Predicting into unknown space? Estimating the area of applicability of spatial prediction models (2005.07939v1)

Published 16 May 2020 in stat.ML and cs.LG

Abstract: Predictive modelling using machine learning has become very popular for spatial mapping of the environment. Models are often applied to make predictions far beyond sampling locations where new geographic locations might considerably differ from the training data in their environmental properties. However, areas in the predictor space without support of training data are problematic. Since the model has no knowledge about these environments, predictions have to be considered uncertain. Estimating the area to which a prediction model can be reliably applied is required. Here, we suggest a methodology that delineates the "area of applicability" (AOA) that we define as the area, for which the cross-validation error of the model applies. We first propose a "dissimilarity index" (DI) that is based on the minimum distance to the training data in the predictor space, with predictors being weighted by their respective importance in the model. The AOA is then derived by applying a threshold based on the DI of the training data where the DI is calculated with respect to the cross-validation strategy used for model training. We test for the ideal threshold by using simulated data and compare the prediction error within the AOA with the cross-validation error of the model. We illustrate the approach using a simulated case study. Our simulation study suggests a threshold on DI to define the AOA at the .95 quantile of the DI in the training data. Using this threshold, the prediction error within the AOA is comparable to the cross-validation RMSE of the model, while the cross-validation error does not apply outside the AOA. This applies to models being trained with randomly distributed training data, as well as when training data are clustered in space and where spatial cross-validation is applied. We suggest to report the AOA alongside predictions, complementary to validation measures.

Citations (209)

View on Semantic Scholar

Summary

The paper introduces the Area of Applicability (AOA) for spatial prediction models using a Dissimilarity Index (DI) to identify regions where predictions may be unreliable outside training data.
The methodology calculates the DI based on predictor distance to training data, weighted by importance, and uses the 0.95 quantile of training DI values as a threshold to define the AOA boundary.
Extensive simulations demonstrate that predictions within the AOA show performance consistent with cross-validation, while those outside exhibit significantly higher errors, highlighting the AOA's value for communicating prediction uncertainty and guiding sampling.

Estimating Areas of Applicability in Spatial Prediction Models

The paper "Predicting into unknown space? Estimating the area of applicability of spatial prediction models" by Meyer and Pebesma addresses the significant challenge of determining the reliability of spatial prediction models, particularly when predicting outside the range of the initial training data. The authors introduce the concept of an "Area of Applicability" (AOA) to specify the spatial regions where model predictions can be deemed reliable based on the match between new environmental conditions and the conditions represented in the training data.

The core contribution of this paper is the proposed methodology to delineate the AOA using a measure known as the "Dissimilarity Index" (DI). The DI is determined based on the distance in the predictor space between new prediction locations and the nearest training data points. The authors normalize these distances by weighting the predictors according to their importance, as derived from Random Forest models, and then compare these to the diversity of the predictor space within the training data. This allows the computation of a standardized measure of dissimilarity, which is used to identify areas where predictions may fall outside of the AOA due to novel environmental conditions not present in the training dataset.

A significant part of the paper demonstrates the utility of a DI threshold, particularly at the 0.95 quantile of the DI values in the training data, as an effective boundary for determining the AOA. Through extensive simulation studies—972 different scenarios involving randomly sampled spatial data—the authors show that using this threshold, the performance (in terms of RMSE) of predictions within the AOA aligns closely with cross-validation RMSE of the predictive model. Conversely, predictions outside the AOA exhibit significantly higher errors, emphasizing the importance of the AOA in communicating prediction uncertainty.

The implications of this approach are substantial in the field of spatial predictive modeling. It offers a method to quantify and communicate the reliability of predictions in spatially heterogeneous environments, where data-driven approaches such as Random Forests commonly encounter limitations due to the extrapolation into unknown spaces. By proposing a standardized method for estimating the AOA, the authors provide practitioners with a tool to accompany traditional validation metrics, ensuring a more informed application of prediction models, specifically in critical areas like environmental management and policy-making.

Furthermore, the paper suggests that acknowledging the AOA could improve sampling strategies by directing efforts towards under-represented environmental conditions in training datasets, potentially increasing model reliability and applicable domain size. This proactive approach to model development and application highlights a pathway for future research to refine predictive models and expand their generalizability in complex landscapes.

Overall, the introduction of the AOA fills a crucial gap in machine learning-based spatial predictions by addressing the transferability and reliability of model outputs beyond training environments. The implications not only enhance theoretical understanding in AI and spatial statistics but also hold practical significance in fields that rely on robust prediction models for environmental assessments and decision-making. Future research could investigate the adaptation of the AOA concept to various machine learning algorithms beyond Random Forests and explore its application in diverse fields requiring spatial predictions.

PDF Markdown

Predicting into unknown space? Estimating the area of applicability of spatial prediction models (2005.07939v1)

Summary

Estimating Areas of Applicability in Spatial Prediction Models

Related Papers