Nadaraya-Watson kernel smoothing as a random energy model (2408.03769v2)

Published 7 Aug 2024 in cond-mat.dis-nn and stat.ML

Abstract: Precise asymptotics have revealed many surprises in high-dimensional regression. These advances, however, have not extended to perhaps the simplest estimator: direct Nadaraya-Watson (NW) kernel smoothing. Here, we describe how one can use ideas from the analysis of the random energy model (REM) in statistical physics to compute sharp asymptotics for the NW estimator when the sample size is exponential in the dimension. As a simple starting point for investigation, we focus on the case in which one aims to estimate a single-index target function using a radial basis function kernel on the sphere. Our main result is a pointwise asymptotic for the NW predictor, showing that it re-scales the argument of the true link function. Our work provides a first step towards a detailed understanding of kernel smoothing in high dimensions.

Summary

The paper demonstrates that the NW kernel smoothing estimator, when viewed as a Random Energy Model, requires an exponential number of samples in high dimensions.
It shows that kernel weights act like Boltzmann weights, leading to phase transitions where either all or only a few data points dominate the estimation.
Numerical experiments validate the asymptotic analysis, highlighting the shift from perfect interpolation to notable generalization errors.

An Analysis of Nadaraya–Watson Kernel Smoothing through the Lens of Random Energy Models

In the paper "Nadaraya–Watson kernel smoothing as a random energy model," Zavatone-Veth and Pehlevan investigate the high-dimensional behavior of the Nadaraya–Watson (NW) kernel smoothing estimator by drawing an analogy to the Random Energy Model (REM). They also explore its relationship to dense associative memories (DAMs). This work provides a sophisticated understanding of why the NW estimator suffers from the curse of dimensionality and offers novel insights into its performance in high dimensions.

Introduction and Background

The NW estimator, a non-parametric regression technique, estimates a scalar function $f(x)$ of a $d$ -dimensional vector $x$ given access to $n$ examples $(x_{\mu}, f(x_{\mu}))$ . The estimator is formulated as:

$\hat{f}_{\mathcal{D}}(x) = \frac{\sum_{\mu=1}^{n} k(x, x_{\mu}) f(x_{\mu})}{\sum_{\mu=1}^{n} k(x, x_{\mu})}$

where $k(x, x')$ is the kernel function. Traditional analyses indicate that the NW estimator requires an exponentially large number of samples $n$ in $d$ to achieve an error below a fixed tolerance, a manifestation of the curse of dimensionality. This contrasts with kernel ridge regression (KRR), which often requires only a polynomial number of samples. The gap in performance raises fundamental questions about the physical reasons for the poor performance of direct smoothing in high dimensions.

Mapping to the Random Energy Model

The authors propose that the NW estimator can be interpreted through the REM, where the kernel weights $k(x,x_{\mu})$ are akin to Boltzmann weights in a system with quenched randomness. The high-dimensional behavior of the NW estimator, particularly its poor performance, can be related to the behavior of REM's free energy landscape. Using the REM analogy, they demonstrate that the NW estimator's relevant regime is characterized by $n = e^{\alpha d}$ as $d \to \infty$ .

A key insight is the condensation transition in the REM, which can be directly mapped to the performance of the NW estimator. In one phase, all data points contribute relatively equally to the estimate, while in another, only a few dominate, resulting in a drastically different behavior depending on the kernel bandwidth and data distribution.

Asymptotics for Spherical Data

For a concrete analysis, the paper assumes that the inputs lie on a unit sphere and uses a radial basis function kernel:

$k(x, x_{\mu}) = e^{\beta \langle x, x_{\mu} \rangle}$

with a single-index target function $f(x) = g(\langle w, x \rangle/d)$ . The asymptotic behavior of the NW estimator is captured by evaluating overlaps using large deviation principles. They derive that the NW estimator effectively renormalizes the overlap between $w$ and $x$ :

$\hat{f}_{\mathcal{D}}(x) \sim g(\rho r_{\ast})$

where $r_{\ast}$ is determined by a potential function derived from the REM. This renormalization impacts the effective generalization ability of the NW estimator.

Implications for Mean-Squared Error and Training Error

The implications for the mean-squared generalization error are derived but left partially as a conjecture due to the complexity of precise error analysis. Nonetheless, the paper delineates how the training error transitions from perfect interpolation in the retrieval phase (where the NW estimator memorizes training points) to a generalization phase where the estimator fails to perfectly interpolate, resulting in mean-squared errors comparable to those on test data.

Numerical Experiments

The authors conduct numerical experiments to validate their theoretical predictions for fixed test points. For example link functions such as $g(x) = |x|$ and $g(x) = \erf(4x)$, they illustrate that the NW estimator’s behavior aligns with their asymptotic analysis, particularly in high dimensions. However, computational constraints limit the scope of these experiments to relatively small dimensions.

Future Directions

This paper opens several avenues for future research. Extending the rigorous analysis of the NW estimator to anisotropic data distributions and different kernel functions constitutes a significant area of interest. Another important direction is to establish explicit finite-size error bounds for practical applications, paralleling advancements in understanding kernel ridge regression.

Conclusion

Zavatone-Veth and Pehlevan’s exploration provides substantial advancements in comprehending the high-dimensional behavior of the NW kernel smoothing estimator. By mapping it to the REM, they elucidate the underlying challenges posed by high dimensionality and offer a framework to predict and analyze its performance. The results presented form a foundational step towards achieving a deeper understanding of various non-parametric regression algorithms in the context of high-dimensional data.