Nyström KSD Estimator
- The technique leverages Nyström landmarks to project kernel Stein discrepancy estimates, maintaining √n-consistency under sub-Gaussian assumptions.
- It reduces computational complexity from quadratic to near-linear by evaluating the kernel on a small subset of samples.
- Empirical benchmarks show that its statistical power closely matches full KSD tests while enabling scalable goodness-of-fit testing.
The Nyström Kernel Stein Discrepancy (Nyström-KSD) estimator is a computationally efficient technique for assessing the goodness-of-fit between a sample distribution and a target density known up to a normalization constant. It accelerates the canonical quadratic-time kernel Stein discrepancy (KSD) statistic using a low-rank projection based on Nyström landmarks. This estimator maintains -consistency under sub-Gaussian assumptions while dramatically reducing time and memory complexity, enabling large-scale goodness-of-fit testing in high-dimensional settings (Kalinke et al., 2024).
1. Kernel Stein Discrepancy: Foundation and Quadratic Bottleneck
Let denote a target probability density on known up to normalization, and a data distribution. The base kernel (e.g., Gaussian or IMQ) with RKHS and feature map yields the Langevin Stein operator:
for . This operator satisfies:
where the Stein feature vector is
The induced Stein kernel is
The population KSD is thus
As , it follows that .
Given samples , two classical estimators are used:
- V-statistic:
- U-statistic (unbiased):
Both approaches require construction and summation over the matrix , resulting in time and memory, which is prohibitive for (Kalinke et al., 2024).
2. Nyström-Based KSD Estimation
2.1 Nyström Landmarks and Subspace Construction
The Nyström-KSD approach projects the empirical mean embedding
onto a subspace spanned by randomly selected "Nyström points." These are , where the are chosen with replacement from .
Define
Equivalently, in ,
Projection of the empirical mean embedding onto this subspace yields the estimator.
2.2 Projection Formula and Computation
The projected mean is
where . Its squared norm is
with
and is the Moore–Penrose pseudoinverse.
Pseudocode
1 |
The time is dominated by evaluations for and for the pseudo-inverse.
3. Computational and Memory Complexity
| Estimator | Time Complexity | Memory Requirement |
|---|---|---|
| Quadratic-time KSD (V/U) | ||
| Nyström-KSD |
For , Nyström-KSD achieves a strictly sub-quadratic runtime and memory. When , the cost becomes near-linear in , representing significant acceleration in large-scale applications (Kalinke et al., 2024).
4. Statistical Accuracy and -Consistency
Under Assumption A (the Stein features are sub-Gaussian in ), the Nyström-KSD estimator achieves a -rate:
Theorem 4.1 (-consistency):
Let with . For regularization parameter , , and , with probability at least ,
where .
If (polynomial decay, ) or (exponential decay), then setting yields estimation error, matching the quadratic estimator rate at sub-quadratic cost.
Proof Sketch:
Three contributions to the deviation are bounded: empirical mean error ( by Hilbert-space Bernstein), projection bias ( by operator-valued Bernstein), and subsampling error (analyzed via conditional concentration and sub-Gaussian tail bounds). Setting balances the bias-variance tradeoff (Kalinke et al., 2024).
5. Implementation Recommendations
- Choice of (number of Nyström landmarks): For spectra with , ensures error. In empirical work, or yielded favorable results.
- Kernel selection and tuning: For smooth distributions, the Gaussian kernel with the "median heuristic" for is effective. For heavy-tailed alternatives, the non-stationary IMQ kernel with generally increases test power. Parameters can be set using median-distance heuristics or other established routines.
- Algorithmic behavior: Landmarks are drawn uniformly (with replacement) from the data indices; increased improves approximation but incurs higher cost.
6. Empirical Benchmarking and Comparative Performance
Kalinke et al. (2024) compared Nyström-KSD ("N-KSD") against:
- Quadratic-time KSD with Gaussian and IMQ kernels,
- Finite-set Stein discrepancy (FSSD-rand, FSSD-opt; linear time),
- Random-feature Stein discrepancies (L¹-IMQ, L²-SechExp; near-linear time).
Test settings included Laplace vs. Normal (, up to $1000$), Student- vs. Normal ( up to $10$), and non-normalized RBM goodness-of-fit (). N-KSD (with ) was orders of magnitude faster than full KSD and, for small , the fastest among all approaches. In statistical power, N-KSD closely matched full KSD (especially with IMQ kernel) and robustly outperformed other fast Stein discrepancy approximations in all tested dimensions (Kalinke et al., 2024).
7. Context and Related Work
The Nyström approximation traces to Williams and Seeger (2001) for kernel acceleration, while classical KSD tests stem from Chwialkowski et al. (2016) and Liu et al. (2016). This estimator leverages tools from operator concentration (Koltchinskii and Lounici, 2017) for statistical analysis. Its main innovation is enabling -consistent KSD-based inference with subquadratic complexity, bridging kernel Stein methods and scalable approximation strategies for reproducing kernel Hilbert spaces.
References:
Kalinke et al., "Nyström Kernel Stein Discrepancy" (Kalinke et al., 2024); Chwialkowski et al. (2016); Liu et al. (2016); Williams & Seeger (2001); Koltchinskii & Lounici (2017).