Adaptive k*-Nearest Neighbors

Updated 26 March 2026

k*-Nearest Neighbors is a family of adaptive nonparametric methods that determine the optimal neighborhood size per query to balance bias and variance.
Techniques such as convex optimization, Bayesian inference, geometric rules, and kernelized sparse coding are used to compute the adaptive parameter k*.
These methods enhance predictive performance by dynamically adjusting local smoothing based on data density, noise levels, and label ambiguity.

A $k^*$ -Nearest Neighbors algorithm denotes a general class of approaches in which the neighborhood size $k^*$ —previously a fixed user-chosen parameter in standard $k$ -NN—is instead selected adaptively, often per query, according to a criterion that expresses local optimality, Bayesian plausibility, or algorithmically-learned tradeoff. These techniques fundamentally address the well-known bias–variance dilemma in nonparametric learning, yielding improved control over local estimation and predictive accuracy. The literature encompasses convex optimization, Bayesian evidence maximization, kernelized sparse coding, and explicit geometric constructions for the discovery or inference of $k^*$ . This entry synthesizes major formulations and algorithmic principles for $k^*$ -NN, including both statistical and algorithmic perspectives.

1. Foundations of Adaptive Neighborhood Size

The principal motivation for $k^*$ -NN is to circumvent the global hyperparameter tuning of $k$ , which often incurs costly cross-validation or model selection cycles and is suboptimal in heterogeneous data regimes. Instead, for each query $x_0$ , a (potentially different) optimal $k^*$ is computed by optimizing a surrogate risk or following a probabilistic model for the neighborhood. Early approaches to this problem include bias–variance balancing (Anava et al., 2017), change-point detection in the local ordering around $x_0$ (Nuti, 2017), and marginal likelihood maximization in generalized Bayesian models (Kim, 2016).

The general prediction rule in $k^*$ -NN is: $\widehat{y}(x_0) = \sum_{i=1}^{n} \alpha_i^* y_i,$ where the support of the optimal weights $\alpha_i^*$ is restricted to the $k^*$ nearest neighbors of $x_0$ ; both $k^*$ and $\alpha^*$ are defined by an adaptive criterion. The result is an estimator that leverages a locally optimal level of smoothness, adapting to density, noise, or label ambiguity in the vicinity of $x_0$ (Anava et al., 2017, Jodas et al., 2022).

2. Mathematical Formulations for $k^*$ Selection

Several distinct approaches to computing $k^*$ are prominent in the literature:

Bias–Variance Convex Relaxation (Anava et al., 2017): For regression, $k^*$ arises as the maximal set of neighbors for which the combined objective

$C\Vert \alpha \Vert_2 + L\sum_{i=1}^n \alpha_i d(x_i, x_0)$

minimizes the finite-sample tradeoff under constraints $\alpha \in \Delta_n$ (the simplex). The unique optimizer is thresholded: $\alpha^*_i \propto \max(0, \lambda - \beta_i)$ , with the implicit cutoff $k^* = \max\{ i: \beta_i < \lambda \}$ , where $\beta_i$ encodes scaled distances.

PL-kNN Geometric Rule (Jodas et al., 2022): $k^*$ is the cardinality of the set of training points lying within a semicircle of radius $r = D_M(s, c^*)$ (Manhattan distance to the nearest class median centroid), oriented toward the “nearest” class. No parameter tuning is required.
Bayesian Change-Point Model (Nuti, 2017): The posterior distribution $p(k | x_{0:\tau-1})$ is computed recursively as the distribution over run-length (number of same-segment nearest neighbors) using the local likelihood and a hazard function. $k^*$ is typically chosen as the MAP (maximum a posteriori) value.
Bayesian Mutual/Symmetric $k$ -NN (Kim, 2016): $k^*$ is found by maximizing the log-evidence (marginal likelihood) $\mathcal{L}(k)$ under a Gaussian process Laplacian covariance induced by $k$ -NN graph structure. The optimal $k^*$ yields the best model fit, trading off complexity and data fit without cross-validation.
Kernelized Sparse Coding in Graph Models (Li et al., 23 Jan 2026): For each sample $x_j$ , an adaptive Lasso is solved over a composite (feature+class) kernel, yielding a sparse weight vector $w_j$ ; the number of nonzero entries $K_j = \| w_j \|_0$ is the learned neighborhood size $k^*$ for $x_j$ .

The following table summarizes these criteria:

Approach	$k^*$ Definition	Key Mechanism
Bias–Variance Convex	Threshold in convex surrogate	Stationarity/KKT
Bayesian Change-pt	MAP/posterior run-length	Change-point recursion
Bayesian Laplacian GP	MLE w.r.t. evidence	Laplacian evidence
Geometric (PL-kNN)	Count in semicircle of radius $r$	Nearest centroid
Kernel Sparse Graph	$\\|w_j\\|_0$ in adaptive Lasso	Sparse reconstruction

3. Algorithmic Realizations and Implementation

Details vary by formulation:

k*-NN Convex Optimizer (Anava et al., 2017): For each query, distances are scaled, sorted, and a threshold $\lambda^*$ is found by explicit algebraic sweep. Final weights and $k^*$ are directly computed from this value ( $O(n\log n + k^*)$ per query).
Bayesian Change-Point (Nuti, 2017): The target’s ordered neighbor sequence is processed recursively using hazard and predictive likelihoods (see on-line Bayesian change-point update equations), yielding the full $p(k|x_{0:\tau-1})$ posterior.
PL-kNN (Jodas et al., 2022): The instance-specific neighborhood is determined in geometric space (by median-centric radius and directional filtering), with no tuning cycles; all points in the eligible sector become the $k^*$ .
Kernel Sparse Graphs ( $k$ NN-Graph) (Li et al., 23 Jan 2026): Each node’s $k^*$ is discovered via kernel Lasso (with per-node regularization proportional to density), and the consensus label is precomputed over this neighborhood. Inference is routed through a hierarchical HNSW index.
Bayesian Laplacian GP (Kim, 2016): Precompute neighbor graphs at all candidate $k$ , construct Laplacian, optimize evidence w.r.t. other GP hyperparameters, then select the maximizing $k^*$ . Algorithmic complexity is $O(n^3)$ per candidate $k$ .

In practical large-scale or high-dimensional regimes, additional techniques such as bandit-based Monte Carlo selection (Bagaria et al., 2018) and distributed protocols (Fathi et al., 2020) provide adaptivity for $k^*$ while maintaining computational feasibility.

4. Bias–Variance Tradeoff and Theoretical Properties

A recurring theme in all $k^*$ -NN schemes is the explicit or implicit control of the local bias–variance tradeoff. Adjusting $k^*$ dynamically as a function of query location density, label heterogeneity, or expected risk formalizes this tradeoff:

In convex formulations (Anava et al., 2017), a larger $k^*$ reduces variance but increases bias; the optimizer selects the cutoff at each point.
PL-kNN (Jodas et al., 2022) demonstrates that $k^*$ rises naturally in high-density regions, functioning as a low-variance estimator, while shrinking in sparse regions, acting as a low-bias estimator.
Bayesian models (Nuti, 2017, Kim, 2016) control $k^*$ via implicit priors on the run-length or model complexity as manifested in the evidence criterion or the hazard function.
Empirical analyses routinely show that locally-adaptive $k^*$ achieves lower error (regression or classification) than globally-tuned $k$ , with typical error decreases of 5-15% on benchmark datasets.

No approach universally dominates: the relative merits of each formulation depend on computational budget, application (classification vs regression), and available prior information.

5. Empirical Performance and Comparative Analysis

Key evaluation results demonstrate consistent advantages of $k^*$ -NN classifiers:

k*-NN (Bias–Variance) (Anava et al., 2017): On eight UCI datasets, outperformed fixed- $k$ and kernel regressors on 7/8 benchmarks with statistically significant gains.
PL-kNN (Jodas et al., 2022): Achieved the highest F1-score on 7/11 datasets, globally outperforming both fixed- $k$ and parameterless competitors (Wilcoxon and Nemenyi tests, $\alpha=0.05$ ).
Bayesian mutual/symmetric $k$ -NN (Kim, 2016): Evidence-based $k^*$ selection often chooses $k^*$ very different from LOOCV, yielding substantially lower test error (up to $50\%$ reduction in artificial datasets).
kNN-Graph (Li et al., 23 Jan 2026): Adaptive sparsification and per-node $k^*$ in HNSW structure yields classification accuracy surpassing all fixed- $k$ and “one-size-fits-all” competitors, e.g., mean accuracy of 73.76% versus 72.22% (O $k$ NN baseline), alongside order-of-magnitude speedups.
Bandit-Based Approaches (Bagaria et al., 2018): In extreme high-dimensional settings (e.g., Tiny ImageNet: $d > 10{,}000$ ), coordinate-wise adaptive nearest neighbor identification achieves 80× reduction in total coordinate computations relative to exact brute-force, matching or exceeding recall of approximate graph-based heuristics.

6. Algorithmic and Computational Considerations

While $k^*$ -NN methods provide compelling statistical guarantees, some practical tradeoffs are significant:

Per-query computational complexity can be higher than fixed- $k$ if each new instance requires custom optimization, e.g., full O( $n^3$ ) Laplacian inversions in Bayesian GP models (Kim, 2016).
PL-kNN and convex $k^*$ -NN (bias–variance) method costs are dominated by distance calculations and sorting, remaining linearithmic in $n$ per query (Jodas et al., 2022, Anava et al., 2017).
Methods employing offline adaptive neighbor selection (e.g., kNN-Graph with HNSW (Li et al., 23 Jan 2026)) can transfer all expensive neighbor computations to the training phase, achieving $O(\log n)$ inference time at query—unlike all classical local $k^*$ -NN schemes.

Hyperparameter burden is generally modest: often a single ratio or density-scale in bias–variance or kernel-sparse models, while PL-kNN and Bayesian algorithms have negligible or data-driven parameter dependence.

7. Extensions, Limitations, and Future Directions

The versatility of $k^*$ -NN methods extends to:

Graph and manifold domains: via multi-source Dijkstra (Har-Peled, 2016) or dynamic planar data structures (Berg et al., 2021).
Distributed and parallel computation: randomized pruning and selection protocols allow efficient $k^*$ -NN in distributed settings, with round-optimal algorithms in the communication-centric $k$ -machine model (Fathi et al., 2020).
Large-scale, high-dimensional, and dynamic data: coordinate sampling with stochastic filtering (Bagaria et al., 2018), as well as fully dynamic data structures (Berg et al., 2021), offer scalable $k^*$ -nearest neighbor retrieval.

Limitations typically arise in computational cost for exhaustive per-query optimization and the scalability of evidence maximization, as well as the requirement for density estimation or kernel selection in some frameworks (Li et al., 23 Jan 2026). In highly dynamic scenarios, adaptive graph or kernel representations may require costly retraining.

The $k^*$ -NN paradigm continues to influence the design of nonparametric estimators, guiding both the construction of new adaptive, parameterless algorithms and the analysis of sample complexity and generalization in locally weighted inference (Anava et al., 2017, Jodas et al., 2022, Li et al., 23 Jan 2026).