KT-NW: Supervised Kernel Thinning

Updated 21 February 2026

Kernel Thinning (KT-NW) is a supervised coreset construction technique that compresses input-target pairs using a tailored meta-kernel for efficient kernel regression.
It employs a deterministic split-and-swap strategy to reduce the maximum mean discrepancy, achieving a quadratic improvement over naïve subsampling approaches.
Empirical results demonstrate that KT-NW significantly lowers computational cost while maintaining competitive mean squared error compared to Full-NW methods.

Kernel Thinning (KT-NW) is a distribution compression and coreset construction technique in the context of kernel regression. It extends the Kernel Thinning (KT) framework to supervised learning, specifically for the Nadaraya–Watson (NW) estimator, providing substantial computational and statistical improvements over naïve i.i.d. subsampling. The KT-NW methodology derives its core efficacy from the use of a supervised “meta-kernel,” tailored to jointly compress both input features and targets, enabling high-fidelity, low-cardinality coresets for regression and related tasks.

1. Theoretical Framework of Kernel Thinning

Kernel Thinning (Dwivedi et al., 2021, Dwivedi et al., 2021) addresses the problem of choosing a size- $m$ subset (or weighted coreset) from an $n$ -point sample $\{x_i\}_{i=1}^n$ such that the empirical averages of all functions in a target RKHS $\mathcal{H}_k$ are closely preserved. The worst-case integration error is measured by the Maximum Mean Discrepancy (MMD),

$\mathrm{MMD}_k(P,Q) = \sup_{\|f\|_{\mathcal{H}_k}\leq 1}\left| \mathbb{E}_P f - \mathbb{E}_Q f \right|.$

Standard approaches like i.i.d. random thinning or uniform subsampling typically achieve $O(n^{-1/4})$ MMD error with subsampled coresets of size $m\approx\sqrt{n}$ . By contrast, KT uses a deterministic-plus-randomized split-and-swap strategy guided by a split kernel $\tilde k$ , yielding $O(n^{-1/2}\sqrt{\log n})$ error (for many common kernels and distributions), representing a quadratic improvement (Dwivedi et al., 2021).

The KT workflow involves recursive halving (split phase) with random assignments favoring reduced MMD, followed by a greedy swap phase to further decrease the discrepancy with respect to the target kernel $k$ . The algorithm is kernel-agnostic, supporting Gaussian, Matérn, inverse multiquadric, Laplace, sinc, and other kernels (Dwivedi et al., 2021).

2. KT-NW: Supervised Kernel Thinning for Nadaraya–Watson Regression

The KT-NW estimator (Gong et al., 2024) generalizes KT to combine with the Nadaraya–Watson estimator for regression. Given i.i.d. data pairs $(x_i, y_i)\in \mathbb{R}^d\times \mathbb{R}$ with $y_i = f^*(x_i) + \xi_i$ , $\xi_i\sim N(0,\sigma^2)$ , the conventional NW estimation is

$\hat f_{\mathrm{NW}}(x) = \frac{\sum_{i=1}^n k(x,x_i)\,y_i}{\sum_{i=1}^n k(x,x_i)}$

requiring $O(n)$ time per query.

KT-NW introduces a supervised “meta-kernel” on data pairs

$\mathbf{k}_{\mathrm{NW}}((x,y),(x',y')) = k(x,x')(1+yy')$

and applies KT (specifically, the CompressPP algorithm) to produce a weighted coreset $C=\{(x_{i_j}, y_{i_j})\}_{j=1}^m$ with weights $\{w_j\}$ . The KT-NW estimator is then

$\hat f_{\mathrm{KT-NW}}(x) = \frac{\sum_{j=1}^m w_j\, k(x,x_{i_j})\, y_{i_j}}{\sum_{j=1}^m w_j\, k(x,x_{i_j})}$

reducing query-time cost to $O(m)\ll O(n)$ . The meta-kernel structure enables both the numerator and denominator of NW regression to be approximated within the RKHS of $\mathbf{k}_{\mathrm{NW}}$ .

3. Theoretical Guarantees and Multiplicative-Error Bounds

Multiplicative-Error Approximation

On the event that KT succeeds (with probability at least $1-\delta$ ), the weighted coreset $C$ satisfies, for any $h$ in the RKHS,

$\left|\frac{1}{n}\sum_{i} h(z_i) - \sum_{j=1}^m w_j h(z_{i_j})\right| \leq O\left(\frac{\sqrt{d}\log n}{m}\right)\|h\|_{H(\mathbf{k})}$

and a multiplicative error bound: $\left| \frac1n\sum h(z_i) \right| \leq (1\pm\varepsilon) \left| \sum w_j h(z_{i_j}) \right|,\quad m \gtrsim n\varepsilon^{-2}$ which provides strong relative-error control over the empirical means with respect to kernels and derivatives required for kernel regression (Gong et al., 2024).

Statistical Optimality

Assuming $f^*$ and the data density $p$ are $\beta$ -Hölder and smooth, and with coreset size $m\approx n^{1/2}$ , KT-NW achieves MSE: $\mathrm{MSE} = \mathbb{E}[(\hat f_{\mathrm{KT-NW}}(x)-f^*(x))^2] \leq C n^{-\beta/(\beta+d)} \log^2 n$ matching the minimax rate of Full-NW up to logarithmic factors, and outperforming uniform $\sqrt{n}$ -subsampling, which only attains $n^{-\beta/(2\beta + d)}$ (Gong et al., 2024).

4. Algorithmic Implementation and Practical Considerations

KT-NW is implemented via the CompressPP coreset construction, with an overall runtime $O(n\log^3 n)$ , storage $O(\sqrt{n})$ , and $O(\sqrt{n})$ kernel evaluations per test point. This yields quadratic speedup over Full-NW ( $O(n)$ per query) and matches the inference cost of naïve subsampling while achieving a superior statistical guarantee (Gong et al., 2024).

Coreset size selection is typically $m\approx \sqrt{n}$ , with bandwidth and regularization determined by cross-validation. Empirical results confirm necessity of the supervised meta-kernel $k(x,x')(1+yy')$ ; alternative choices such as feature-only or concatenation-based kernels yield inferior MSE in regression benchmarks.

5. Empirical Performance and Benchmarking

Experiments across synthetic ( $d=1$ , $f(x) = 8\sin(8\pi x)\exp(x)$ , $\sigma=1$ , Wendland kernel), real regression (California Housing, $n=2\cdot10^4$ , $d=8$ , Gaussian kernel), and large-scale classification (SUSY, $N=5\cdot10^6$ , $d=18$ , Laplace kernel) demonstrate:

KT-NW achieves MSE and test/training times nearly identical to subsampled NW (ST-NW) and close to Full-NW, with costs reduced by orders of magnitude compared to full-data inference.
In regression, KT-NW's MSE is within a small factor of Full-NW, and KT-NW outperforms RPCholesky thinning by a logarithmic factor in runtime (Gong et al., 2024).
In classification, KT-NW processes 4M samples in 1.7s on a single core, with error in-between ST-NW and RPCholesky.
Ablation studies show the supervised meta-kernel is statistically optimal for supervised compression.

A typical table from (Gong et al., 2024) illustrates comparative performance:

Method	MSE	Train (s)	Test (s)
Full-NW	0.414	11.11	0.70
ST-NW	0.574	0.002	0.009
RPCholesky	0.350	0.324	0.006
KT-NW	0.558	0.015	0.008

KT-NW compresses preprocessing from 11s to 0.015s and query from 0.7s to 0.008s with negligible accuracy trade-off.

KT-NW is one instantiation of a broader class of kernel-based distribution compression methods developed by Dwivedi & Mackey (Dwivedi et al., 2021, Dwivedi et al., 2021), who introduced multiple kernel-thinning variants:

KT-NW (Normalized-Kernel KT): Uses a normalized kernel $k_{\mathrm{NW}}(x,y) = k(x,y)/[k(x,x)k(y,y)]^{1/2}$ , ensuring all bounds are dimension-free in constant.
Target KT: Uses the target kernel directly as the split kernel for tightest single-function error bounds.
Power KT: Employs a fractional power split kernel to improve MMD rates for non-smooth kernels like Laplace and Matérn.
KT+: Combines target and power kernels for simultaneous best-case single-function and MMD allows.

All these variants are cast in a generalized split-and-swap template, providing a unified theory and practical set of tools for kernel coreset construction.

7. Practical Guidelines and Limitations

Practical implementation recommendations include the use of median bandwidth heuristics ( $\sigma\approx1/\sqrt{2d}$ ), cross-validation on held-out MMD, and always carrying an i.i.d. baseline for comparison. When targeting a single function, Target KT or KT-NW is preferred; for worst-case MMD, Power KT is optimal; for both objectives, KT+ is superior.

Complexity is $O(n^2)$ in kernel evaluations for small $m$ ; memory can be reduced via low-rank decompositions. The method is robust across kernel choices, with fractional-power modifications expanding the feasible kernel class. The core limitation is scalability to extremely large $n$ without further engineering.

References

R. Dwivedi & L. Mackey, “Supervised Kernel Thinning” (Gong et al., 2024)
R. Dwivedi & L. Mackey, “Generalized Kernel Thinning” (Dwivedi et al., 2021)
R. Dwivedi & L. Mackey, “Kernel Thinning” (Dwivedi et al., 2021)

Markdown Report Issue Upgrade to Chat

References (3)

Kernel Thinning (2021)

Generalized Kernel Thinning (2021)

Supervised Kernel Thinning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kernel Thinning (KT-NW).

KT-NW: Supervised Kernel Thinning

1. Theoretical Framework of Kernel Thinning

2. KT-NW: Supervised Kernel Thinning for Nadaraya–Watson Regression

3. Theoretical Guarantees and Multiplicative-Error Bounds

Multiplicative-Error Approximation

Statistical Optimality

4. Algorithmic Implementation and Practical Considerations

5. Empirical Performance and Benchmarking

7. Practical Guidelines and Limitations

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KT-NW: Supervised Kernel Thinning

1. Theoretical Framework of Kernel Thinning

2. KT-NW: Supervised Kernel Thinning for Nadaraya–Watson Regression

3. Theoretical Guarantees and Multiplicative-Error Bounds

Multiplicative-Error Approximation

Statistical Optimality

4. Algorithmic Implementation and Practical Considerations

5. Empirical Performance and Benchmarking

6. Related Variants and Context in the Kernel Thinning Ecosystem

7. Practical Guidelines and Limitations

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research