Importance of spatial predictor variable selection in machine learning applications -- Moving from data reproduction to spatial prediction (1908.07805v1)

Published 21 Aug 2019 in stat.AP, cs.LG, and stat.ML

Abstract: Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training samples. We assume that not only spatial validation strategies but also spatial variable selection is essential for reliable spatial predictions. We introduce two case studies that use remote sensing to predict land cover and the leaf area index for the "Marburg Open Forest", an open research and education site of Marburg University, Germany. We use the machine learning algorithm Random Forests to train models using non-spatial and spatial cross-validation strategies to understand how spatial variable selection affects the predictions. Our findings confirm that spatial cross-validation is essential in preventing overoptimistic model performance. We further show that highly autocorrelated predictors (such as geolocation variables, e.g. latitude, longitude) can lead to considerable overfitting and result in models that can reproduce the training data but fail in making spatial predictions. The problem becomes apparent in the visual assessment of the spatial predictions that show clear artefacts that can be traced back to a misinterpretation of the spatially autocorrelated predictors by the algorithm. Spatial variable selection could automatically detect and remove such variables that lead to overfitting, resulting in reliable spatial prediction patterns and improved statistical spatial model performance. We conclude that in addition to spatial validation, a spatial variable selection must be considered in spatial predictions of ecological data to produce reliable predictions.

Citations (266)

View on Semantic Scholar

Summary

The paper demonstrates that spatial variable selection improves prediction accuracy by mitigating overfitting caused by spatial autocorrelation.
The paper shows that using spatial cross-validation with Random Forest algorithms yields more reliable predictions in ecological case studies.
The paper advocates for omitting geolocation variables via forward feature selection to prevent misleading importance scores and overfitting.

Importance of Spatial Predictor Variable Selection in Machine Learning Applications

The paper investigates the role of spatial predictor variable selection in improving the robustness of machine learning models, particularly for spatial prediction tasks in ecology. The authors posit that traditional machine learning approaches, which often neglect the spatial characteristics of data, result in models that perform well at reproducing training data but fail to make reliable spatial predictions beyond those locations. The paper is significant because it underscores the importance of both spatial validation strategies and spatial variable selection in ecological modeling, aiming to move from simple data reproduction to accurate spatial prediction.

The research is structured around two case studies implementing Random Forest algorithms to perform spatial predictions: land use/land cover (LULC) classification and Leaf Area Index (LAI) modeling in the "Marburg Open Forest" in Germany. Each case paper employs different spatial and non-spatial cross-validation strategies to evaluate how spatial variable selection affects prediction accuracy.

Key Findings and Contributions

Spatial Cross-validation Necessity: The paper confirms the importance of spatial cross-validation in avoiding overoptimistic evaluations of model performance. Random cross-validation strategies fail to account for spatial dependencies and, thus, do not provide reliable measures of a model’s performance in unseen spatial domains.
Overfitting Due to Spatial Autocorrelation: The paper highlights the issue of overfitting induced by highly autocorrelated predictors, such as geolocation variables. These variables may be significant within the training dataset but do not generalize to additional spatial data, leading to models that excel at data reproduction rather than true prediction.
Spatial Variable Selection: Through Forward Feature Selection (FFS) in combination with spatial cross-validation, the research advocates for automatic detection and exclusion of variables that may cause overfitting. The empirical results indicate that models perform better spatially when predictors like latitude and longitude—which cause spatial artifacts—are removed.
Variable Importance: A critical observation is how non-spatially meaningful variables (e.g., geolocation) often dominate variable importance metrics within models, advocating caution in their inclusion in spatial predictions. Recursive Feature Elimination (RFE), often employed in feature selection, is shown to be inadequate without accompanying spatial validation due to its reliance on these misleading importance scores.

Implications and Future Directions

The implications of this paper stretch beyond ecological data modeling to any domain engaging in spatial prediction tasks. By demonstrating that spatial characteristics need explicit attention, the authors call for a paradigm shift in data processing and model evaluation in spatial contexts. The consideration of spatial dependency should permeate all stages of modeling, from data preparation to validation and interpretation of results.

Future research should explore developing algorithms or frameworks that inherently account for spatial dependencies, potentially integrating them more seamlessly into the model training processes. Furthermore, expanding these methodologies to other ecological and environmental prediction tasks will test the generality and adaptability of the proposed solutions. The paper opens avenues to enrich predictive modeling with spatial awareness, enhancing the reliability of models deployed in real-world scenarios.

The paper presents a nuanced approach to handling spatial data in machine learning, advocating for strategies that consider spatial dependencies as integral rather than supplementary. By rigorously applying spatial validation and variable selection, it is possible to enhance the validity of spatial predictions across various ecological applications.

PDF Markdown

Importance of spatial predictor variable selection in machine learning applications -- Moving from data reproduction to spatial prediction (1908.07805v1)

Summary

Importance of Spatial Predictor Variable Selection in Machine Learning Applications

Key Findings and Contributions

Implications and Future Directions

Related Papers