KT-NW: Supervised Kernel Thinning
- Kernel Thinning (KT-NW) is a supervised coreset construction technique that compresses input-target pairs using a tailored meta-kernel for efficient kernel regression.
- It employs a deterministic split-and-swap strategy to reduce the maximum mean discrepancy, achieving a quadratic improvement over naïve subsampling approaches.
- Empirical results demonstrate that KT-NW significantly lowers computational cost while maintaining competitive mean squared error compared to Full-NW methods.
Kernel Thinning (KT-NW) is a distribution compression and coreset construction technique in the context of kernel regression. It extends the Kernel Thinning (KT) framework to supervised learning, specifically for the Nadaraya–Watson (NW) estimator, providing substantial computational and statistical improvements over naïve i.i.d. subsampling. The KT-NW methodology derives its core efficacy from the use of a supervised “meta-kernel,” tailored to jointly compress both input features and targets, enabling high-fidelity, low-cardinality coresets for regression and related tasks.
1. Theoretical Framework of Kernel Thinning
Kernel Thinning (Dwivedi et al., 2021, Dwivedi et al., 2021) addresses the problem of choosing a size- subset (or weighted coreset) from an -point sample such that the empirical averages of all functions in a target RKHS are closely preserved. The worst-case integration error is measured by the Maximum Mean Discrepancy (MMD),
Standard approaches like i.i.d. random thinning or uniform subsampling typically achieve MMD error with subsampled coresets of size . By contrast, KT uses a deterministic-plus-randomized split-and-swap strategy guided by a split kernel , yielding error (for many common kernels and distributions), representing a quadratic improvement (Dwivedi et al., 2021).
The KT workflow involves recursive halving (split phase) with random assignments favoring reduced MMD, followed by a greedy swap phase to further decrease the discrepancy with respect to the target kernel . The algorithm is kernel-agnostic, supporting Gaussian, Matérn, inverse multiquadric, Laplace, sinc, and other kernels (Dwivedi et al., 2021).
2. KT-NW: Supervised Kernel Thinning for Nadaraya–Watson Regression
The KT-NW estimator (Gong et al., 2024) generalizes KT to combine with the Nadaraya–Watson estimator for regression. Given i.i.d. data pairs with , , the conventional NW estimation is
requiring time per query.
KT-NW introduces a supervised “meta-kernel” on data pairs
and applies KT (specifically, the CompressPP algorithm) to produce a weighted coreset with weights . The KT-NW estimator is then
reducing query-time cost to . The meta-kernel structure enables both the numerator and denominator of NW regression to be approximated within the RKHS of .
3. Theoretical Guarantees and Multiplicative-Error Bounds
Multiplicative-Error Approximation
On the event that KT succeeds (with probability at least ), the weighted coreset satisfies, for any in the RKHS,
and a multiplicative error bound: which provides strong relative-error control over the empirical means with respect to kernels and derivatives required for kernel regression (Gong et al., 2024).
Statistical Optimality
Assuming and the data density are -Hölder and smooth, and with coreset size , KT-NW achieves MSE: matching the minimax rate of Full-NW up to logarithmic factors, and outperforming uniform -subsampling, which only attains (Gong et al., 2024).
4. Algorithmic Implementation and Practical Considerations
KT-NW is implemented via the CompressPP coreset construction, with an overall runtime , storage , and kernel evaluations per test point. This yields quadratic speedup over Full-NW ( per query) and matches the inference cost of naïve subsampling while achieving a superior statistical guarantee (Gong et al., 2024).
Coreset size selection is typically , with bandwidth and regularization determined by cross-validation. Empirical results confirm necessity of the supervised meta-kernel ; alternative choices such as feature-only or concatenation-based kernels yield inferior MSE in regression benchmarks.
5. Empirical Performance and Benchmarking
Experiments across synthetic (, , , Wendland kernel), real regression (California Housing, , , Gaussian kernel), and large-scale classification (SUSY, , , Laplace kernel) demonstrate:
- KT-NW achieves MSE and test/training times nearly identical to subsampled NW (ST-NW) and close to Full-NW, with costs reduced by orders of magnitude compared to full-data inference.
- In regression, KT-NW's MSE is within a small factor of Full-NW, and KT-NW outperforms RPCholesky thinning by a logarithmic factor in runtime (Gong et al., 2024).
- In classification, KT-NW processes 4M samples in 1.7s on a single core, with error in-between ST-NW and RPCholesky.
- Ablation studies show the supervised meta-kernel is statistically optimal for supervised compression.
A typical table from (Gong et al., 2024) illustrates comparative performance:
| Method | MSE | Train (s) | Test (s) |
|---|---|---|---|
| Full-NW | 0.414 | 11.11 | 0.70 |
| ST-NW | 0.574 | 0.002 | 0.009 |
| RPCholesky | 0.350 | 0.324 | 0.006 |
| KT-NW | 0.558 | 0.015 | 0.008 |
KT-NW compresses preprocessing from 11s to 0.015s and query from 0.7s to 0.008s with negligible accuracy trade-off.
6. Related Variants and Context in the Kernel Thinning Ecosystem
KT-NW is one instantiation of a broader class of kernel-based distribution compression methods developed by Dwivedi & Mackey (Dwivedi et al., 2021, Dwivedi et al., 2021), who introduced multiple kernel-thinning variants:
- KT-NW (Normalized-Kernel KT): Uses a normalized kernel , ensuring all bounds are dimension-free in constant.
- Target KT: Uses the target kernel directly as the split kernel for tightest single-function error bounds.
- Power KT: Employs a fractional power split kernel to improve MMD rates for non-smooth kernels like Laplace and Matérn.
- KT+: Combines target and power kernels for simultaneous best-case single-function and MMD allows.
All these variants are cast in a generalized split-and-swap template, providing a unified theory and practical set of tools for kernel coreset construction.
7. Practical Guidelines and Limitations
Practical implementation recommendations include the use of median bandwidth heuristics (), cross-validation on held-out MMD, and always carrying an i.i.d. baseline for comparison. When targeting a single function, Target KT or KT-NW is preferred; for worst-case MMD, Power KT is optimal; for both objectives, KT+ is superior.
Complexity is in kernel evaluations for small ; memory can be reduced via low-rank decompositions. The method is robust across kernel choices, with fractional-power modifications expanding the feasible kernel class. The core limitation is scalability to extremely large without further engineering.
References
- R. Dwivedi & L. Mackey, “Supervised Kernel Thinning” (Gong et al., 2024)
- R. Dwivedi & L. Mackey, “Generalized Kernel Thinning” (Dwivedi et al., 2021)
- R. Dwivedi & L. Mackey, “Kernel Thinning” (Dwivedi et al., 2021)