Two-Scale Distributional Nearest-Neighbor Regression

Updated 18 September 2025

The paper introduces a novel two-scale framework that combines estimators from coarse and fine scales to cancel bias and control variance for distributional regression.
The methodology leverages adaptive multiscale weighting and kernel embedding techniques to manage two-stage sampling and ensure minimax-optimal convergence rates.
The approach supports rigorous uncertainty quantification through CLT, bootstrap, and jackknife methods, providing reliable inference on complex, distributional inputs.

Two-Scale Distributional Nearest-Neighbor Regression is a paradigm that leverages multilevel nearest-neighbor relationships to model and estimate responses from distributional inputs, capturing both local and global adaptation. This approach unifies advances from classical nonparametric regression, modern kernel embedding theory, multi-scale weighting, and distributional statistics in a comprehensive framework that enables minimax-optimal inference, adaptive estimation, and rigorous uncertainty quantification in complex data regimes.

1. Theoretical Foundations: Nonparametric Rates and Multi-Scale Bias–Variance Tradeoff

The central statistical insight underlying Two-Scale Distributional Nearest-Neighbor Regression is the control of bias and variance via multi-level sampling strategies. In classical regression, with covariates $X \sim \text{Uniform}([0,1]^d)$ and $(p, C)$ -smooth regression function $m$ , the minimax rate over $n$ samples is $n^{-2p/(2p+d)}$ for $1 < p \leq 1.5$ (Ayano, 2011). The k-NN estimator achieves this rate for $k = \lfloor n^{2p/(2p+d)} \rfloor$ , balancing neighborhood size to synchronize bias and variance effects.

When generalizing to Two-Scale methods, one estimator is computed at a "coarse" (large $k$ or neighborhood radius) and another at a "fine" (small $k$ ) scale. By linearly combining estimators from both scales—often with explicit weights that can be negative—the approach compensates for leading-order bias terms present in single-scale estimators (Demirkaya et al., 2018). The two-scale estimator often takes the form: $\widehat{m}_{2\text{-scale}}(x) = \alpha_1 \widehat{m}_{k_1}(x) + \alpha_2 \widehat{m}_{k_2}(x)$ with $(\alpha_1, \alpha_2)$ chosen to annihilate $O(h^\ell)$ expansion terms (e.g., for $\ell=4$ smoothness).

This multi-scale bias cancellation, alongside variance control afforded by U-statistic theory and the Hajek projection, enables the two-scale estimator to attain the optimal nonparametric convergence rate under higher-order smoothness assumptions, with asymptotic normality and bootstrap validity for confidence intervals (via jackknife and bootstrap estimation) (Demirkaya et al., 2018).

2. Distributional Inputs: Regression on Probability Measures

In distributional regression, the input "covariate" is itself a probability distribution $\mu$ , observed only through finite samples. A major analytical challenge is the two-stage sampling process: first, distributions are sampled from a meta-distribution, and second, each distribution is observed through a finite bag of draws (Szabo et al., 2014, Szabo et al., 2014). The regression map takes the general form $f: \mu \mapsto y$ , with risk

$R[f] = \mathbb{E}_{\mu, y}\left[ (f(\mu) - y)^2 \right]$

Two-Scale Nearest-Neighbor methods in this context frequently use metric or similarity measures between probability distributions, such as Wasserstein, Maximum Mean Discrepancy (MMD), and set kernels. The estimator may combine similarities computed at coarse and fine scales—either by varying subsample sizes used to estimate the distributional embedding (mean, covariance, higher moments), or by using different radii or bandwidths for kernel or nearest-neighbor search.

Embeddings to reproducing kernel Hilbert spaces (RKHS), followed by regularized regression (kernel ridge regression), facilitate consistency guarantees in the two-stage sampled setting, matching the one-stage minimax rate and addressing a longstanding open question for set kernels (Szabo et al., 2014, Szabo et al., 2014).

3. Multi-Scale Weighting and Adaptive Algorithms

A defining methodological feature is the use of adaptively generated weights across multiple scales. In the context of distributional nearest neighbor regression, bagging and weighted nearest neighbors yield estimators with weights determined by subsampling or local geometry. The two-scale approach linearly combines two single-scale estimators, sometimes with negative weights, to cancel higher-order bias (Demirkaya et al., 2018). Explicit formulas for weight computations are derived:

For scale $s_1$ and $s_2$ , the weight assigned to each neighbor is constructed so that the total bias (from Taylor expansion in $h$ ) vanishes up to prescribed order.

Extension to adaptive schemes can be achieved by varying $k$ or neighborhood width spatially, according to local data density (as in adaptive k-NN) (Zhao et al., 2019, Zamolodtchikov et al., 21 Jan 2024). This results in estimators that choose neighborhood size based on sample density, with detailed risk bounds showing minimax optimality under density-tail and margin conditions even in heavy-tailed or unbounded support regimes.

4. Statistical Inference: Asymptotic Normality, Jackknife, and Bootstrap

Statistical inference in Two-Scale Distributional Nearest-Neighbor Regression rests on U-statistic theory and the Hoeffding decomposition. The estimator can be written as a (possibly infinite-order) U-statistic, with leading order (Hajek projection) capturing the stochastic variability (Demirkaya et al., 2018). Under mild conditions, the remaining higher-order terms in the U-statistic decomposition are negligible, and the leading term dominates, permitting central limit theorem (CLT) results.

Variance estimation and construction of valid confidence intervals are made possible via explicit formulas provided by the Hajek projection and are supported by bootstrapping and jackknife methods. The framework effectively linearizes the estimator to a sum of nearly i.i.d. terms, for which standard bootstrap validity holds.

5. Unified View: From Distributional Empirical Processes to Wasserstein Regression

A coherent understanding emerges from advances in empirical process theory, kernel mean embedding, and optimal transport. The k-NN empirical measure

$\widehat{\mu}_{n,k,x}(A) = \frac{1}{k}\sum_{i\in N_{n,k}(x)}\mathbb{I}_{A}(Y_i)$

is central to estimating distributional quantities such as the conditional cumulative distribution, quantiles, or conditional law (Portier, 2021). For conditional distribution estimation, the L² risk (and CRPS-based scoring) is minimized at the Bayes optimal distribution, and both k-NN and kernel regression estimators reach the minimax rate $n^{-2h/(2h+d)}$ under Hölder regularity (Pic et al., 2022).

When regression curves are sought between distributions (e.g., in time series of distributions or images), Wasserstein regression and multi-marginal optimal transport recast the problem as an MMOT task, with regression curves fitted in distribution space, generalizing nearest-neighbor and least-squares regression to measure-valued data (Karimi et al., 2021).

6. Algorithmic and Computational Considerations

Two-Scale Distributional Nearest-Neighbor Regression introduces computational complexity due to multi-scale search and combination. Approximate nearest neighbor search (locality-sensitive hashing, random forests, cover trees, boundary forests) supports scalability to large datasets (Chen et al., 21 Feb 2025). The estimation process involves:

Multi-scale neighborhood search (adaptive $k$ , variable radii)
Weighted aggregation (some weights negative for bias cancellation)
Density estimation when necessary, especially in transfer learning or covariate shift (Zamolodtchikov et al., 21 Jan 2024)
Embedding computation (mean, covariance, kernel features) for distribution inputs

Selection of algorithm parameters (number of neighbors, neighborhood radii, weighting schemes) is typically guided by explicit formulas guaranteeing desired error thresholds, often expressed in terms of sample size, dimension, smoothness, and local data density (Chen et al., 21 Feb 2025). For distributional inputs, selection of metrics (Wasserstein, MMD, set kernels) and regularization is dictated by the underlying geometry.

7. Applications and Frontier Directions

Applications span regression on sets, images, time-series distributions, and multi-instance learning (Szabo et al., 2014, Szabo et al., 2014, Karimi et al., 2021, Pic et al., 2022). Practical implications include minimax-optimal postprocessing in meteorological forecasting, multi-scale confidence interval construction, adaptive transfer learning under covariate shift, and selective distributional regression with reject options (where abstention is based on estimated uncertainty via the entropy of CRPS (Zaoui et al., 31 Mar 2025)). The two-scale paradigm serves as a design principle for new robust, scalable regression algorithms that adaptively blend global structure and local adaptation to achieve optimal rates and rigorous inference.

In summary, Two-Scale Distributional Nearest-Neighbor Regression synthesizes optimal rate theory, bias-variance tradeoff via multi-scale combination, empirical process methodology, kernel and transport-based distributional statistics, and scalable nearest-neighbor search to provide a principled foundation for regression on complex, distributional data with minimax risk guarantees, valid uncertainty quantification, and adaptive capacity for challenging data geometries and sampling regimes.