On the use of adversarial validation for quantifying dissimilarity in geospatial machine learning prediction (2404.12575v1)
Abstract: Recent geospatial machine learning studies have shown that the results of model evaluation via cross-validation (CV) are strongly affected by the dissimilarity between the sample data and the prediction locations. In this paper, we propose a method to quantify such a dissimilarity in the interval 0 to 100%, and from the perspective of the data feature space. The proposed method is based on adversarial validation, which is an approach that can check whether sample data and prediction locations can be separated with a binary classifier. To study the effectiveness and generality of our method, we tested it on a series of experiments based on both synthetic and real datasets and with gradually increasing dissimilarities. Results show that the proposed method can successfully quantify dissimilarity across the entire range of values. Next to this, we studied how dissimilarity affects CV evaluations by comparing the results of random CV and of two spatial CV methods, namely block and spatial+ CV. Our results showed that CV evaluations follow similar patterns in all datasets and predictions: when dissimilarity is low (usually lower than 30%), random CV provides the most accurate evaluation results. As dissimilarity increases, spatial CV methods, especially spatial+ CV, become more and more accurate and even outperforming random CV. When dissimilarity is high (>=90%), no CV method provides accurate evaluations. These results show the importance of considering feature space dissimilarity when working with geospatial machine learning predictions, and can help researchers and practitioners to select more suitable CV methods for evaluating their predictions.
- Aguilar, Rosa, Raul Zurita-Milla, Emma Izquierdo-Verdiguier, and Rolf A. de By. 2018. “A Cloud-Based Multi-Temporal Ensemble Classifier to Map Smallholder Farming Systems.” Remote Sensing 10 (5): 729.
- Amato, Federico, Fabian Guignard, Sylvain Robert, and Mikhail Kanevski. 2020. “A novel framework for spatio-temporal prediction of environmental data using deep learning.” Scientific Reports 10 (1): 1–11.
- Belgiu, Mariana, and Lucian Drăguţ. 2016. “Random forest in remote sensing: A review of applications and future directions.” ISPRS International Journal of Geo-Information 114: 24–31.
- Breiman, Leo. 2001. “Random forests.” Machine Learning 45 (1): 5–32.
- Brenning, A. 2005. “Spatial prediction models for landslide hazards: review, comparison and evaluation.” Natural Hazards and Earth System Sciences 5 (6): 853–862.
- Brenning, Alexander. 2012. “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In International Geoscience and Remote Sensing Symposium (IGARSS), 5372–5375.
- Brus, D. J., B. Kempen, and G. B.M. Heuvelink. 2011. “Sampling for validation of digital soil maps.” European Journal of Soil Science 62 (3): 394–407.
- Bueno, Marcelo, Briggitte Macera, and Nilton Montoya. 2023. “A Comparative Analysis of Machine Learning Techniques for National Glacier Mapping: Evaluating Performance through Spatial Cross-Validation in Perú.” Water 15 (24): 4214.
- Chen, Gongbo, Yichao Wang, Shanshan Li, Wei Cao, Hongyan Ren, Luke D. Knibbs, Michael J. Abramson, and Yuming Guo. 2018. “Spatiotemporal patterns of PM10 concentrations over China during 2005–2016: A satellite-based estimation using the random forests approach.” Environmental Pollution 242: 605–613.
- Chen, Jianhua, Kaihang Xu, Zheng Zhao, Xianxia Gan, and Huawei Xie. 2024. “A cellular automaton integrating spatial case-based reasoning for predicting local landslide hazards.” International Journal of Geographical Information Science 38 (1): 100–127.
- Chen, Songchao, Dominique Arrouays, Vera Leatitia Mulder, Laura Poggio, Budiman Minasny, Pierre Roudier, Zamir Libohova, et al. 2022. “Digital mapping of GlobalSoilMap soil properties at a broad scale: A review.” Geoderma 409: 115567.
- Cheng, Yanchao, Nils Benjamin Tjaden, Anja Jaeschke, Renke Lühken, Ute Ziegler, Stephanie Margarete Thomas, and Carl Beierkuhnlein. 2018. “Evaluating the risk for Usutu virus circulation in Europe: Comparison of environmental niche models and epidemiological models.” International Journal of Health Geographics 17 (1): 1–14.
- de Bruin, Sytze, Dick J. Brus, Gerard B.M. Heuvelink, Tom van Ebbenhorst Tengbergen, and Alexandre M.J-C. Wadoux. 2022. “Dealing with clustered samples for assessing map accuracy by cross-validation.” Ecological Informatics 69: 101665.
- FastML. 2016. “Adversarial validation.” http://fastml.com/adversarial-validation-part-one/.
- Garcia-Marti, Irene, Raul Zurita-Milla, Margriet G. Harms, and Arno Swart. 2018. “Using volunteered observations to map human exposure to ticks.” Scientific Reports 8 (1): 15435.
- Goetz, J. N., A. Brenning, H. Petschko, and P. Leopold. 2015. “Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling.” Computers & Geosciences 81: 1–11.
- Guerra, Carlos A., Anna Heintz-Buschart, Johannes Sikorski, Antonis Chatzinotas, Nathaly Guerrero-Ramírez, Simone Cesarz, Léa Beaumelle, et al. 2020. “Blind spots in global soil biodiversity and ecosystem function research.” Nature Communications 11 (1): 1–13.
- Guo, Jiangang, Jinfeng Wang, Chengdong Xu, and Yongze Song. 2022. “Modeling of spatial stratified heterogeneity.” GIScience & Remote Sensing 59 (1): 1660–1677.
- Hengl, Tomislav, Gerard B. M. Heuvelink, Bas Kempen, Johan G. B. Leenaars, Markus G. Walsh, Keith D. Shepherd, Andrew Sila, et al. 2015. “Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions.” Plos One 10 (6): e0125814.
- Hengl, Tomislav, Madlene Nussbaum, Marvin N. Wright, Gerard B.M. Heuvelink, and Benedikt Gräler. 2018. “Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables.” PeerJ 6: e5518.
- Hitouri, Sliman, Antonietta Varasano, Meriame Mohajane, Safae Ijlil, Narjisse Essahlaoui, Sk Ajim Ali, Ali Essahlaoui, et al. 2022. “Hybrid Machine Learning Approach for Gully Erosion Mapping Susceptibility at a Watershed Scale.” ISPRS International Journal of Geo-Information 2022, Vol. 11, Page 401 11 (7): 401.
- Khodadadzadeh, Mahdi, and Richard Gloaguen. 2019. “Upscaling High-Resolution Mineralogical Analyses to Estimate Mineral Abundances in Drill Core Hyperspectral Data.” In International Geoscience and Remote Sensing Symposium (IGARSS) 2019, jul, 1845–1848. Institute of Electrical and Electronics Engineers Inc.
- Lagacherie, P., D. Arrouays, H. Bourennane, C. Gomez, and L. Nkuba-Kasanda. 2020. “Analysing the impact of soil spatial sampling on the performances of Digital Soil Mapping models and their evaluation: A numerical experiment on Quantile Random Forest using clay contents obtained from Vis-NIR-SWIR hyperspectral imagery.” Geoderma 375: 114503.
- Lamichhane, Sushil, Lalit Kumar, and Brian Wilson. 2019. “Digital soil mapping algorithms and covariates for soil organic carbon mapping and their implications: A review.” Geoderma 352: 395–413.
- Le Rest, Kévin, David Pinaud, Pascal Monestiez, Joël Chadoeuf, and Vincent Bretagnolle. 2014. “Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation.” Global Ecology and Biogeography 23 (7): 811–820.
- Li, Boyi, Adu Gong, Tingting Zeng, Wenxuan Bao, Can Xu, and Zhiqing Huang. 2021a. “A Zoning Earthquake Casualty Prediction Model Based on Machine Learning.” Remote Sensing 14 (1): 30.
- Li, Yao, Peng Cui, Chengming Ye, José Marcato Junior, Zhengtao Zhang, Jian Guo, and Jonathan Li. 2021b. “Accurate Prediction of Earthquake-Induced Landslides Based on Deep Learning Considering Landslide Source Area.” Remote Sensing 13 (17): 3436.
- Linnenbrink, Jan, Carles Milà, Marvin Ludwig, and Hanna Meyer. 2023. “kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation.” EGUsphere [preprint] .
- Ludwig, Marvin, Alvaro Moreno-Martinez, Norbert Hölzel, Edzer Pebesma, and Hanna Meyer. 2023. “Assessing and improving the transferability of current global spatial prediction models.” Global Ecology and Biogeography 32 (3): 356–368.
- Lyons, Mitchell B., David A. Keith, Stuart R. Phinn, Tanya J. Mason, and Jane Elith. 2018. “A comparison of resampling methods for remote sensing classification and accuracy assessment.” Remote Sensing of Environment 208: 145–153.
- Meyer, Hanna, and Edzer Pebesma. 2022. “Machine learning-based global maps of ecological variables and the challenge of assessing them.” Nature Communications 13 (1): 1–4.
- Milà, Carles, Jorge Mateu, — Edzer Pebesma, and Hanna Meyer. 2022. “Nearest neighbour distance matching Leave-One-Out Cross-Validation for map validation.” Methods in Ecology and Evolution 13 (6): 1304–1316.
- Montesinos-López, Osval A., Abelardo Montesinos-López, and Kismiantini. 2023. “Designing optimal training sets for genomic prediction using adversarial validation with probit regression.” Plant Breeding 142 (5): 594–606.
- Mussumeci, Elisa, and Flávio Codeço Coelho. 2020. “Large-scale multivariate forecasting models for Dengue - LSTM versus random forest regression.” Spatial and Spatio-temporal Epidemiology 35: 100372.
- Nesha, Mst Karimon, Yousif Ali Hussin, Louise Marianne van Leeuwen, and Yohanes Budi Sulistioadi. 2020. “Modeling and mapping aboveground biomass of the restored mangroves using ALOS-2 PALSAR-2 in East Kalimantan, Indonesia.” International Journal of Applied Earth Observation and Geoinformation 91: 102158.
- Oliveira, Mariana, Luís Torgo, and Vítor Santos Costa. 2021. “Evaluation Procedures for Forecasting with Spatiotemporal Data.” Mathematics 9 (6): 691.
- Ploton, Pierre, Frédéric Mortier, Maxime Réjou-Méchain, Nicolas Barbier, Nicolas Picard, Vivien Rossi, Carsten Dormann, et al. 2020. “Spatial validation reveals poor predictive performance of large-scale ecological mapping models.” Nature Communications 11: 4540.
- Pohjankukka, Jonne, Tapio Pahikkala, Paavo Nevalainen, and Jukka Heikkonen. 2017. “Estimating the prediction performance of spatial models via spatial k-fold cross validation.” International Journal of Geographical Information Science 31 (10): 2001–2019.
- Qian, Hongyi, Baohui Wang, Ping Ma, Lei Peng, Songfeng Gao, and You Song. 2022. “Managing Dataset Shift by Adversarial Validation for Credit Scoring.” In PRICAI 2022: Trends in Artificial Intelligence., edited by G. Khanna, S., Cao, J., Bai, Q., Xu, Vol. 13629 LNCS, 477–488. Springer, Cham.
- Roberts, David R., Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.” Ecography 40 (8): 913–929.
- Sarafian, Ron, Itai Kloog, Elad Sarafian, Ian Hough, and Jonathan D. Rosenblatt. 2021. “A Domain Adaptation Approach for Performance Estimation of Spatial Predictions.” IEEE Transactions on Geoscience and Remote Sensing 59 (6): 5197–5205.
- Sarailidis, Georgios, Thorsten Wagener, and Francesca Pianosi. 2023. “Integrating scientific knowledge into machine learning using interactive decision trees.” Computers & Geosciences 170: 105248.
- Schlather, Martin, Alexander Malinowski, Peter J. Menck, Marco Oesting, and Kirstin Strokorb. 2015. “Analysis, Simulation and Prediction of Multivariate Random Fields with Package RandomFields.” Journal of Statistical Software 63 (1): 1–25.
- Stock, Andy, and Ajit Subramaniam. 2022. “Iterative spatial leave-one-out cross-validation and gap-filling based data augmentation for supervised learning applications in marine remote sensing.” GIScience & Remote Sensing 59 (1): 1281–1300.
- Usman, Muhammad, Mahnoor Ejaz, Janet E. Nichol, Muhammad Shahid Farid, Sawaid Abbas, and Muhammad Hassan Khan. 2023. “A Comparison of Machine Learning Models for Mapping Tree Species Using WorldView-2 Imagery in the Agroforestry Landscape of West Africa.” ISPRS International Journal of Geo-Information 12 (4): 142.
- Valavi, Roozbeh, Jane Elith, José J. Lahoz‐Monfort, and Gurutzeta Guillera‐Arroita. 2019. “BlockCV : An R package for generating spatially or environmentally separated folds for k ‐fold cross‐validation of species distribution models.” Methods in Ecology and Evolution 10 (2): 225–232.
- Wadoux, Alexandre M.J.C., Gerard B.M. Heuvelink, Sytze de Bruin, and Dick J. Brus. 2021. “Spatial cross-validation is not the right way to evaluate map accuracy.” Ecological Modelling 457: 109692.
- Wang, Jin Feng, A. Stein, Bin Bo Gao, and Yong Ge. 2012. “A review of spatial sampling.” Spatial Statistics 2 (1): 1–14.
- Wang, Yanwen, Mahdi Khodadadzadeh, and Raúl Zurita-Milla. 2023. “Spatial+: A new cross-validation method to evaluate geospatial machine learning models.” International Journal of Applied Earth Observation and Geoinformation 121: 103364.
- Wiens, Trevor S., Brenda C. Dale, Mark S. Boyce, and G. Peter Kershaw. 2008. “Three way k-fold cross-validation of resource selection functions.” Ecological Modelling 212 (3-4): 244–255.
- Wu, Wei, Qipo Yang, Jiake Lv, Aidi Li, and Hongbin Liu. 2019. “Investigation of Remote Sensing Imageries for Identifying Soil Texture Classes Using Classification Methods.” IEEE Transactions on Geoscience and Remote Sensing 57 (3): 1653–1663.
- Zhang, Wen, Zhengjiang Liu, Yan Xue, Ruibo Wang, Xuefei Cao, and Jihong Li. 2023. “An Improved Cross-Validated Adversarial Validation Method.” In Knowledge Science, Engineering and Management. KSEM 2023., edited by W. Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, 343–353. Springer, Cham.
- Zhao, Wei, Ainong Li, Pan Huang, He Juelin, and Ma Xianming. 2017. “Surface soil moisture relationship model construction based on random forest method.” In International Geoscience and Remote Sensing Symposium (IGARSS) 2017, Vol. 2017-July, jul, 2019–2022. IEEE.
- Zurita-Milla, R., V. C.E. Laurent, and J. A.E. van Gijsel. 2015. “Visualizing the ill-posedness of the inversion of a canopy radiative transfer model: A case study for Sentinel-2.” International Journal of Applied Earth Observation and Geoinformation 43: 7–18.