Papers
Topics
Authors
Recent
Search
2000 character limit reached

KT-NW: Supervised Kernel Thinning

Updated 21 February 2026
  • Kernel Thinning (KT-NW) is a supervised coreset construction technique that compresses input-target pairs using a tailored meta-kernel for efficient kernel regression.
  • It employs a deterministic split-and-swap strategy to reduce the maximum mean discrepancy, achieving a quadratic improvement over naïve subsampling approaches.
  • Empirical results demonstrate that KT-NW significantly lowers computational cost while maintaining competitive mean squared error compared to Full-NW methods.

Kernel Thinning (KT-NW) is a distribution compression and coreset construction technique in the context of kernel regression. It extends the Kernel Thinning (KT) framework to supervised learning, specifically for the Nadaraya–Watson (NW) estimator, providing substantial computational and statistical improvements over naïve i.i.d. subsampling. The KT-NW methodology derives its core efficacy from the use of a supervised “meta-kernel,” tailored to jointly compress both input features and targets, enabling high-fidelity, low-cardinality coresets for regression and related tasks.

1. Theoretical Framework of Kernel Thinning

Kernel Thinning (Dwivedi et al., 2021, Dwivedi et al., 2021) addresses the problem of choosing a size-mm subset (or weighted coreset) from an nn-point sample {xi}i=1n\{x_i\}_{i=1}^n such that the empirical averages of all functions in a target RKHS Hk\mathcal{H}_k are closely preserved. The worst-case integration error is measured by the Maximum Mean Discrepancy (MMD),

MMDk(P,Q)=supfHk1EPfEQf.\mathrm{MMD}_k(P,Q) = \sup_{\|f\|_{\mathcal{H}_k}\leq 1}\left| \mathbb{E}_P f - \mathbb{E}_Q f \right|.

Standard approaches like i.i.d. random thinning or uniform subsampling typically achieve O(n1/4)O(n^{-1/4}) MMD error with subsampled coresets of size mnm\approx\sqrt{n}. By contrast, KT uses a deterministic-plus-randomized split-and-swap strategy guided by a split kernel k~\tilde k, yielding O(n1/2logn)O(n^{-1/2}\sqrt{\log n}) error (for many common kernels and distributions), representing a quadratic improvement (Dwivedi et al., 2021).

The KT workflow involves recursive halving (split phase) with random assignments favoring reduced MMD, followed by a greedy swap phase to further decrease the discrepancy with respect to the target kernel kk. The algorithm is kernel-agnostic, supporting Gaussian, Matérn, inverse multiquadric, Laplace, sinc, and other kernels (Dwivedi et al., 2021).

2. KT-NW: Supervised Kernel Thinning for Nadaraya–Watson Regression

The KT-NW estimator (Gong et al., 2024) generalizes KT to combine with the Nadaraya–Watson estimator for regression. Given i.i.d. data pairs (xi,yi)Rd×R(x_i, y_i)\in \mathbb{R}^d\times \mathbb{R} with yi=f(xi)+ξiy_i = f^*(x_i) + \xi_i, ξiN(0,σ2)\xi_i\sim N(0,\sigma^2), the conventional NW estimation is

f^NW(x)=i=1nk(x,xi)yii=1nk(x,xi)\hat f_{\mathrm{NW}}(x) = \frac{\sum_{i=1}^n k(x,x_i)\,y_i}{\sum_{i=1}^n k(x,x_i)}

requiring O(n)O(n) time per query.

KT-NW introduces a supervised “meta-kernel” on data pairs

kNW((x,y),(x,y))=k(x,x)(1+yy)\mathbf{k}_{\mathrm{NW}}((x,y),(x',y')) = k(x,x')(1+yy')

and applies KT (specifically, the CompressPP algorithm) to produce a weighted coreset C={(xij,yij)}j=1mC=\{(x_{i_j}, y_{i_j})\}_{j=1}^m with weights {wj}\{w_j\}. The KT-NW estimator is then

f^KTNW(x)=j=1mwjk(x,xij)yijj=1mwjk(x,xij)\hat f_{\mathrm{KT-NW}}(x) = \frac{\sum_{j=1}^m w_j\, k(x,x_{i_j})\, y_{i_j}}{\sum_{j=1}^m w_j\, k(x,x_{i_j})}

reducing query-time cost to O(m)O(n)O(m)\ll O(n). The meta-kernel structure enables both the numerator and denominator of NW regression to be approximated within the RKHS of kNW\mathbf{k}_{\mathrm{NW}}.

3. Theoretical Guarantees and Multiplicative-Error Bounds

Multiplicative-Error Approximation

On the event that KT succeeds (with probability at least 1δ1-\delta), the weighted coreset CC satisfies, for any hh in the RKHS,

1nih(zi)j=1mwjh(zij)O(dlognm)hH(k)\left|\frac{1}{n}\sum_{i} h(z_i) - \sum_{j=1}^m w_j h(z_{i_j})\right| \leq O\left(\frac{\sqrt{d}\log n}{m}\right)\|h\|_{H(\mathbf{k})}

and a multiplicative error bound: 1nh(zi)(1±ε)wjh(zij),mnε2\left| \frac1n\sum h(z_i) \right| \leq (1\pm\varepsilon) \left| \sum w_j h(z_{i_j}) \right|,\quad m \gtrsim n\varepsilon^{-2} which provides strong relative-error control over the empirical means with respect to kernels and derivatives required for kernel regression (Gong et al., 2024).

Statistical Optimality

Assuming ff^* and the data density pp are β\beta-Hölder and smooth, and with coreset size mn1/2m\approx n^{1/2}, KT-NW achieves MSE: MSE=E[(f^KTNW(x)f(x))2]Cnβ/(β+d)log2n\mathrm{MSE} = \mathbb{E}[(\hat f_{\mathrm{KT-NW}}(x)-f^*(x))^2] \leq C n^{-\beta/(\beta+d)} \log^2 n matching the minimax rate of Full-NW up to logarithmic factors, and outperforming uniform n\sqrt{n}-subsampling, which only attains nβ/(2β+d)n^{-\beta/(2\beta + d)} (Gong et al., 2024).

4. Algorithmic Implementation and Practical Considerations

KT-NW is implemented via the CompressPP coreset construction, with an overall runtime O(nlog3n)O(n\log^3 n), storage O(n)O(\sqrt{n}), and O(n)O(\sqrt{n}) kernel evaluations per test point. This yields quadratic speedup over Full-NW (O(n)O(n) per query) and matches the inference cost of naïve subsampling while achieving a superior statistical guarantee (Gong et al., 2024).

Coreset size selection is typically mnm\approx \sqrt{n}, with bandwidth and regularization determined by cross-validation. Empirical results confirm necessity of the supervised meta-kernel k(x,x)(1+yy)k(x,x')(1+yy'); alternative choices such as feature-only or concatenation-based kernels yield inferior MSE in regression benchmarks.

5. Empirical Performance and Benchmarking

Experiments across synthetic (d=1d=1, f(x)=8sin(8πx)exp(x)f(x) = 8\sin(8\pi x)\exp(x), σ=1\sigma=1, Wendland kernel), real regression (California Housing, n=2104n=2\cdot10^4, d=8d=8, Gaussian kernel), and large-scale classification (SUSY, N=5106N=5\cdot10^6, d=18d=18, Laplace kernel) demonstrate:

  • KT-NW achieves MSE and test/training times nearly identical to subsampled NW (ST-NW) and close to Full-NW, with costs reduced by orders of magnitude compared to full-data inference.
  • In regression, KT-NW's MSE is within a small factor of Full-NW, and KT-NW outperforms RPCholesky thinning by a logarithmic factor in runtime (Gong et al., 2024).
  • In classification, KT-NW processes 4M samples in 1.7s on a single core, with error in-between ST-NW and RPCholesky.
  • Ablation studies show the supervised meta-kernel is statistically optimal for supervised compression.

A typical table from (Gong et al., 2024) illustrates comparative performance:

Method MSE Train (s) Test (s)
Full-NW 0.414 11.11 0.70
ST-NW 0.574 0.002 0.009
RPCholesky 0.350 0.324 0.006
KT-NW 0.558 0.015 0.008

KT-NW compresses preprocessing from 11s to 0.015s and query from 0.7s to 0.008s with negligible accuracy trade-off.

KT-NW is one instantiation of a broader class of kernel-based distribution compression methods developed by Dwivedi & Mackey (Dwivedi et al., 2021, Dwivedi et al., 2021), who introduced multiple kernel-thinning variants:

  • KT-NW (Normalized-Kernel KT): Uses a normalized kernel kNW(x,y)=k(x,y)/[k(x,x)k(y,y)]1/2k_{\mathrm{NW}}(x,y) = k(x,y)/[k(x,x)k(y,y)]^{1/2}, ensuring all bounds are dimension-free in constant.
  • Target KT: Uses the target kernel directly as the split kernel for tightest single-function error bounds.
  • Power KT: Employs a fractional power split kernel to improve MMD rates for non-smooth kernels like Laplace and Matérn.
  • KT+: Combines target and power kernels for simultaneous best-case single-function and MMD allows.

All these variants are cast in a generalized split-and-swap template, providing a unified theory and practical set of tools for kernel coreset construction.

7. Practical Guidelines and Limitations

Practical implementation recommendations include the use of median bandwidth heuristics (σ1/2d\sigma\approx1/\sqrt{2d}), cross-validation on held-out MMD, and always carrying an i.i.d. baseline for comparison. When targeting a single function, Target KT or KT-NW is preferred; for worst-case MMD, Power KT is optimal; for both objectives, KT+ is superior.

Complexity is O(n2)O(n^2) in kernel evaluations for small mm; memory can be reduced via low-rank decompositions. The method is robust across kernel choices, with fractional-power modifications expanding the feasible kernel class. The core limitation is scalability to extremely large nn without further engineering.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kernel Thinning (KT-NW).