Geospatial Samplers: Methods and Applications

Updated 10 March 2026

Geospatial samplers are statistical and algorithmic methods designed to extract spatially balanced samples, effectively managing spatial autocorrelation and dataset heterogeneity.
They are applied in diverse settings such as remote sensing patch extraction, finite-population survey sampling, scalable Bayesian modeling, and graph-based sequential sampling.
These methods enhance data inference and computational efficiency by leveraging tailored algorithms like RandomGeoSampler, HPWD, and Graph Spatial Sampling to optimize spatial spread.

Geospatial samplers are statistical and algorithmic methodologies designed to select spatially distributed samples or data patches from large geospatial datasets, with the objective of supporting inference, learning, or design-based estimation while explicitly controlling the spatial allocation and balance of the selected units or windows. These methods address challenges arising from spatial autocorrelation, high dimensionality, and heterogeneity in geographic data, and have become central both in classical survey sampling (finite-population context) and in the management of raster or vector-based remote sensing data streams encountered in modern deep learning pipelines.

1. Taxonomy of Geospatial Sampling Methods

Several broad classes of geospatial samplers are distinguished by their domain of application, sampling pattern, and mathematical principle:

Samplers for Patch Extraction in Remote Sensing: These operate over extremely large raster scenes (remotely sensed imagery) and select spatial windows for training or inference purposes in deep learning workflows. TorchGeo implements three samplers:
- RandomGeoSampler: Draws random spatial windows across all scenes with scene-selection probability proportional to spatial area.
- RandomBatchGeoSampler: Picks a random scene per batch and samples multiple random windows from it, maximizing file I/O locality.
- GridGeoSampler: Systematically tiles each scene along a fixed stride, with optional overlap, primarily for full-coverage prediction (Stewart et al., 2021).
Spatially Balanced Finite-Population Designs: Methods such as Heuristic PWD (HPWD), WAVE, and n-Means Spatial Sampling enforce spatial spread in representative “unit” selection within a finite frame.
- HPWD: Sequentially reweights inclusion probabilities by a function of distance from previously selected units, enforcing strong spatial balance at O(nN) cost (Benedetti et al., 2017).
- WAVE: Iteratively adjusts probabilities so that local neighborhoods behave “repulsively” (weakly associated vectors), exactly attaining the inclusion constraints (Jauslin et al., 2019).
- Intelligent n-Means: Combines a translation-invariant spreadness index with UP-balanced k-means clustering and an intelligent search for ordering, optimizing both balance and spread (Panahbehagh et al., 28 Oct 2025).
Spatial Subsampling for Scalable Modeling: Samplers that facilitate scalable Bayesian spatial inference via data-subset models or gridding strategies.
- SDSM: Repeatedly selects small design-based subsamples for likelihood evaluation in spatial models, preserving moments and spatial dependence (Saha et al., 2023).
- DAG- or Mesh-Based Gridding: Partitions space into blocks or grid cells, exploiting induced conditional independence for fast MCMC or implementation on massive datasets (Peruzzi et al., 2021, Peruzzi et al., 2022).
Graph-Based and Sequential Algorithms:
- Graph Spatial Sampling (GSS): Designs spatial sampling on a graph via lagged Metropolis–Hastings walks, offering direct control over sampling probabilities and autocorrelation structure (Zhang, 2022).
- Sequential/Online Samplers: Make balanced, spatially spread sampling decisions as data stream in, using auxiliary variables and distance-aware constraints to dynamically reassign probabilities (Jauslin et al., 2021).

This taxonomy is summarized in the table below.

Class	Example Algorithms	Core Principle
Remote Sensing Patch	RandomGeoSampler, GridGS	Window-based, area-weighted or tiled patching
Finite-Pop Sampling	HPWD, WAVE, n-Means	Distance-based probability updates, spatial clustering
Scalable Model-based	SDSM, Grids, DAGs	Random or systematic subsets, induced DAG sparsity
Graph/Sequential	GSS, Sequential Balancing	Graph-walks, sequential optimization of balance/spread

2. Mathematical Foundations and Algorithms

Geospatial samplers rely on explicit mathematical formulations to control inclusion probabilities, spatial dispersion, and efficiency.

Patch-based (TorchGeo): For a raster scene $s$ with extent $E_s$ , a window of pixel size $(h, w)$ at ground resolution $r$ corresponds to

$q = (x, y, \Delta x, \Delta y), \quad \Delta x = w r, \quad \Delta y = h r$

RandomGeoSampler samples scene $s$ with probability proportional to area, then uniformly selects $x, y$ within feasible ranges (Stewart et al., 2021).

HPWD (“Draw-by-Draw”):

$\pi_j^{(t)} = \frac{\pi_j^{(t-1)} \bar d_{i_{t-1},j}}{ \sum_{\ell \notin S_{t-1}} \pi_{\ell}^{(t-1)} \bar d_{i_{t-1},\ell} }$

This scheme iteratively down-weights probabilities according to standardized spatial distances, promoting repulsion among nearby units (Benedetti et al., 2017).

WAVE (Weakly Associated Vector Sampling):

The updating vector $u$ is “weakly associated” if

$1^\top u = 0, \quad H^{(r)} u = 0$

where $H^{(r)}$ encodes r-nearest neighbors; updates drive local negative covariance (Jauslin et al., 2019).

Graph Spatial Sampling: Uses a lagged Metropolis–Hastings walk on a designed spatial graph $G=(V,E)$ , adjusting backtracking and jump rates to control mixing and stationarity

$P((i,h)\to(h,j)) = \frac{r}{d_h + r} u_j + \frac{d_h}{d_h + r} Q_{(i,h)\to(h,j)} \alpha(i\to j)$

with $Q$ , $\alpha$ as detailed in the data (Zhang, 2022).

n-Means Spatial Sampling: Clusters the population $U$ into $n$ sets via UP-balancing, then employs ordering/search and bar-stacking to ensure exact $\pi$ -control with optimal spread, measured by the Density Disparity Index (S-index) (Panahbehagh et al., 28 Oct 2025).
SDSM (Spatial Data Subset Model): At each MCMC iteration, selects a subset $\delta$ of size $m$ , fits the likelihood to this subsample, and derives spatial moment corrections for sill, nugget, and range according to the underlying design (Saha et al., 2023).

3. Performance, Computational Complexity, and Scaling

Computational efficiency is central due to the high dimensionality and large scale of geospatial data:

TorchGeo Patch Sampling: GridGeoSampler achieves up to $200$ patches/sec (batch size 1) with maximal cache reuse, while random samplers reach $40$–$50$ patches/sec and degrade with larger batch sizes. Pre-warping boosts speed by up to $3\times$ at the cost of storage (Stewart et al., 2021).
HPWD: $O(nN)$ total cost, supporting large $N$ ( $>10^4$ ). C++ implementation yields $0.2$ ms/draw for $n=600, N=3,000$ . Approximates costly $O(N^2)$ or MCMC schemes with negligible loss of spatial balance (Benedetti et al., 2017).
WAVE: Theoretical cost $O(N^2)$ naïve; practical implementations using spatial indexing (kd-trees) achieve $O(N\log N)$ . Empirically matches or exceeds GRTS/LPM on balance and mean squared error (Jauslin et al., 2019).
n-Means/GMS: Clustering and ordering scale as $O((n/\delta)\, n\, T)$ , where $\delta$ is split-size and $T$ cost of k-means iteration. Hungarian matching is negligible for $n < 200$ (Panahbehagh et al., 28 Oct 2025).
Graph Spatial Sampling: Mixing time scales as $O(1/(1-\lambda_2))$ , where $\lambda_2$ is the second largest eigenvalue. Design efficiency can be up to $10\times$ higher than LPM/GRTS for smooth spatial trends (Zhang, 2022).
Mesh-based Model Sampling: Gridding and DAG approaches reduce cubic GP complexity ( $O(n^3)$ ) to $O(n)$ per iteration by exploiting Markov structure and coloring, as in GriPS and SiMPA (Peruzzi et al., 2021, Peruzzi et al., 2022).
SDSM: MCMC cost per iteration is $O(m^3)$ independent of $N$ , enabling scalable inference for $N\to 10^7$ provided $m$ (subset size) is small; predictive errors rapidly decrease as $m$ increases but with cubic cost in $m$ (Saha et al., 2023).

4. Evaluation, Empirical Benchmarks, and Comparative Analysis

Comparative studies and simulations highlight the empirical efficacy and domain of each sampler:

HPWD: Matches or exceeds spatial balance indexes and achieves RMSE reductions of $40$– $60\%$ over SRS in autocorrelated scenarios. On LUCAS and Meuse datasets, outperforms GRTS (Benedetti et al., 2017).
WAVE: Delivers $15$– $30\%$ lower MSE versus GRTS/LPM, exact inclusion probabilities, and negative-neighbor covariances; variance estimator remains accurate (Jauslin et al., 2019).
n-Means/GMS: In broad simulations (clustered, random, regular), GMS achieves the lowest Voronoi Index and most negative Moran’s $I$ across all tested $n$ , outperforming GRTS, LP1, DAS, and SRS (Panahbehagh et al., 28 Oct 2025).
SDSM: For MODIS data ( $N=150,000$ ), with $m=2,000$ , RMSE is $2.3$ K, competitive with FRK, SPDE/INLA, MRA, NNGP, but with much lower memory and computational demand (Saha et al., 2023).
Mesh and DAG-based MCMC: GriPS achieves $100\times$ higher ESS/s for covariance parameters versus NNGP; SiMPA achieves $11\times$ higher ESS/s for estimating cross-correlation matrices over Hmsc. In MODIS remote sensing, SiMPA outperforms INLA and HMC in both calibration and computation (Peruzzi et al., 2021, Peruzzi et al., 2022).
TorchGeo Samplers: Enable plug-and-play integration with PyTorch DataLoader, achieving robust sampling speeds and transparent handling of CRS/reprojection (Stewart et al., 2021).

5. Practical Implementation Guidelines and Limitations

Application context, parameter tuning, and computational considerations determine optimal sampler choice:

Remote Sensing: For inference/full-coverage, GridGeoSampler with pre-warped data is preferred for speed. For training deep networks on random patches, RandomGeoSampler/RandomBatchGeoSampler balance I/O and representativeness (Stewart et al., 2021).
Finite-population Designs: HPWD and WAVE are robust to clustering but assume a well-calibrated distance matrix; WAVE needs $r$ (neighborhood size) tuning, HPWD’s $\gamma$ controls spatial repulsion magnitude (Benedetti et al., 2017, Jauslin et al., 2019).
Clustering/Intelligent n-Means: Requires careful setting of split-size $\delta$ , members per cluster $m$ , and restarts to avoid local minima. Spreadness index $S$ remains diagnostic across bandwidths and geometries (Panahbehagh et al., 28 Oct 2025).
Graph-based/MCMC: Choice of graph structure, jump/backtrack rates ( $r$ , $w$ ), and number of walks balances mixing efficiency and spread (Zhang, 2022).
Sequential Samplers: Pool/window size should be modestly larger than auxiliary dimension, with each LP small. Fully streaming if population size is unknown (Jauslin et al., 2021).
SDSM/Gridding: Subset size $m$ (vs total $N$ ) is critical. “Elbow” plots of RMSE vs $m$ inform the trade-off. Subsampling designs may be replaced by space-filling strategies for improved performance (Saha et al., 2023).
Mesh-based Methods: Grid/DAG construction must balance sparsity (to retain computational benefits) versus Markov blanket size (to capture spatial structure). SiMPA adaptation parameters must ensure bounded curvature and ergodicity (Peruzzi et al., 2022).

6. Extensions, Open Problems, and Future Directions

Open questions and current frontiers in geospatial sampling research include:

Adaptivity: Integration of spatially balanced or adaptive sampling designs into scalable Bayesian models, possibly using Poisson–disk/Halton sequences or stratified designs in SDSM (Saha et al., 2023).
Metaheuristic Search: Intelligent n-Means/GMS opens the door to metaheuristics (genetic algorithms, simulated annealing) for exploring the high-dimensional design search space (Panahbehagh et al., 28 Oct 2025).
Spatial-Temporal/Semi-supervised Sampling: Extension of balancing and spreadness criteria into the spatio-temporal domain and leveraging auxiliary variables or domain knowledge for further gains (Jauslin et al., 2021, Panahbehagh et al., 28 Oct 2025).
Variance Estimation and Robustness: Analytical results for variance estimators, especially under unequal probability and for variance estimation under extreme clustering, as well as adaptive regularization for practical deployments (Jauslin et al., 2019).
Model-Assisted Designs: Integration of explicit spatial covariance structure into sample design optimization, unifying model-based and design-based frameworks (Panahbehagh et al., 28 Oct 2025).
Software and Tooling: Continued enhancement of public R/Python/CRAN packages (BalancedSampling, meshed, TorchGeo) to support large-scale, high-dimensional geospatial applications and integration with downstream ML or statistical pipelines (Peruzzi et al., 2022, Stewart et al., 2021).

Geospatial samplers interface with several other statistical and computational paradigms:

Design-Based Survey Sampling: Samplers such as HPWD, WAVE, and n-Means extend the pivotal method, cube method, and classical stratified/tessellation designs (GRTS, SCPS) by explicitly encoding spatial structure for balance and efficiency (Benedetti et al., 2017, Jauslin et al., 2019).
Space-Filling and Clustering in Experimental Design: n-Means, clustering-based, and spreadness-based approaches are akin to space-filling LHS, maximin, or determinantal point processes but with explicit inclusion-probability management (Panahbehagh et al., 28 Oct 2025).
Scalable Bayesian Inference: Subsampling (SDSM), gridding (GriPS), and meshing (SiMPA) approaches permit Kriging-class modeling and spatial process inference at unprecedented scales, and can be viewed as structural approximations or latent sparsification of classic Gaussian process regression (Saha et al., 2023, Peruzzi et al., 2021, Peruzzi et al., 2022).
Remote Sensing/Deep Learning Integration: Patch-based samplers (TorchGeo) enable systematic and random extraction of spatial windows for both supervised and self-supervised learning, supporting transparent handling of variable CRS, multi-band data, and massive file sizes (Stewart et al., 2021).
Graph-based and MCMC-based Unification: Graph Spatial Sampling and mesh-based DAG approaches unify the classical Markov random field, spatial neighbor, and random walk methodologies with modern spatially balanced sampling, offering a common foundation for variance reduction and design-based estimators (Zhang, 2022, Peruzzi et al., 2022).

These convergences foster a growing ecosystem of theoretically grounded, computationally efficient samplers addressing the needs of modern spatial data science.