Nyström KSD Estimator

Updated 23 February 2026

The technique leverages Nyström landmarks to project kernel Stein discrepancy estimates, maintaining √n-consistency under sub-Gaussian assumptions.
It reduces computational complexity from quadratic to near-linear by evaluating the kernel on a small subset of samples.
Empirical benchmarks show that its statistical power closely matches full KSD tests while enabling scalable goodness-of-fit testing.

The Nyström Kernel Stein Discrepancy (Nyström-KSD) estimator is a computationally efficient technique for assessing the goodness-of-fit between a sample distribution and a target density known up to a normalization constant. It accelerates the canonical quadratic-time kernel Stein discrepancy (KSD) statistic using a low-rank projection based on Nyström landmarks. This estimator maintains $\sqrt n$ -consistency under sub-Gaussian assumptions while dramatically reducing time and memory complexity, enabling large-scale goodness-of-fit testing in high-dimensional settings (Kalinke et al., 2024).

1. Kernel Stein Discrepancy: Foundation and Quadratic Bottleneck

Let $P$ denote a target probability density $p$ on $\mathbb{R}^d$ known up to normalization, and $Q$ a data distribution. The base kernel $k$ (e.g., Gaussian or IMQ) with RKHS $\mathcal{H}_k$ and feature map $\phi_k(x) = k(\cdot, x)$ yields the Langevin Stein operator:

$(T_p f)(x) = \langle \nabla_x \log p(x), f(x) \rangle + \mathrm{trace}(\nabla_x f(x))$

for $f \in \mathcal{H}_k^d$ . This operator satisfies:

$(T_p f)(x) = \langle f, \xi_p(x) \rangle_{\mathcal{H}_k^d}$

where the Stein feature vector is

$\xi_p(x) = \nabla_x \log p(x)\,\phi_k(x) + \nabla_x \phi_k(x) \in \mathcal{H}_k^d.$

The induced Stein kernel is

$h_p(x, y) = \langle \xi_p(x), \xi_p(y) \rangle_{\mathcal{H}_k^d}.$

The population KSD is thus

$S_p(Q) = \left\| \mathbb{E}_{X \sim Q}[\xi_p(X)] \right\|_{\mathcal{H}_k^d} = \left\| \mathbb{E}_{X \sim Q} [h_p(\cdot, X)] \right\|_{\mathcal{H}_{h_p}}.$

As $\mathbb{E}_P[\xi_p(X)] = 0$ , it follows that $S_p(P) = 0$ .

Given samples $\{x_i\}_{i=1}^n \sim Q$ , two classical estimators are used:

V-statistic:

$S_p^2(\widehat Q_n) = \frac{1}{n^2} \sum_{i, j=1}^n h_p(x_i, x_j)$

U-statistic (unbiased):

$S_{p,u}^2(\widehat Q_n) = \frac{1}{n(n-1)} \sum_{i \ne j} h_p(x_i, x_j)$

Both approaches require construction and summation over the $n \times n$ matrix $h_p(x_i, x_j)$ , resulting in $\mathcal{O}(n^2)$ time and memory, which is prohibitive for $n \gg 10^3$ (Kalinke et al., 2024).

2. Nyström-Based KSD Estimation

2.1 Nyström Landmarks and Subspace Construction

The Nyström-KSD approach projects the empirical mean embedding

$\mu_{h_p}(\widehat Q_n) = \frac{1}{n} \sum_{i=1}^n h_p(\cdot, x_i)$

onto a subspace spanned by $m \ll n$ randomly selected "Nyström points." These are $z_j = x_{i_j}$ , where the $i_j$ are chosen with replacement from $\{1, ..., n\}$ .

Define

$\mathcal{H}_{h_p, m} = \mathrm{span}\{ h_p(\cdot, z_j) : 1 \leq j \leq m \} \subset \mathcal{H}_{h_p}.$

Equivalently, in $\mathcal{H}_k^d$ ,

$\mathcal{H}_{p,m} = \mathrm{span}\{\xi_p(z_j): j=1, ..., m\} \subset \mathcal{H}_k^d.$

Projection of the empirical mean embedding onto this subspace yields the estimator.

2.2 Projection Formula and Computation

The projected mean is

$\hat{\mu}^{\mathrm{Nys}} = P_{\mathcal{H}_{h_p,m}} \hat\mu$

where $\hat\mu = \frac{1}{n} \sum_{i=1}^n h_p(\cdot, x_i)$ . Its squared norm is

$\tilde S_p^2(\widehat Q_n) = \|\hat\mu^{\mathrm{Nys}}\|_{\mathcal{H}_{h_p}}^2 = \beta^\top K_{mm}^{-} \beta$

with

$\begin{align*} K_{mm} &= [h_p(z_j, z_{j'})]_{j,j'=1}^m \in \mathbb{R}^{m \times m} \ K_{mn} &= [h_p(z_j, x_i)]_{j=1,\ldots, m,\,i=1,\ldots,n} \in \mathbb{R}^{m \times n} \ \beta &= \frac{1}{n} K_{mn} 1_n \in \mathbb{R}^m \end{align*}$

and $K_{mm}^{-}$ is the Moore–Penrose pseudoinverse.

Pseudocode

The time is dominated by $\mathcal{O}(mn)$ evaluations for $K_{mn}$ and $\mathcal{O}(m^3)$ for the pseudo-inverse.

3. Computational and Memory Complexity

Estimator	Time Complexity	Memory Requirement
Quadratic-time KSD (V/U)	$\mathcal{O}(n^2)$	$\mathcal{O}(n^2)$
Nyström-KSD	$\mathcal{O}(mn + m^3)$	$\mathcal{O}(mn + m^2)$

For $m = o(n^{2/3})$ , Nyström-KSD achieves a strictly sub-quadratic runtime and memory. When $m \approx \sqrt n$ , the cost becomes near-linear in $n$ , representing significant acceleration in large-scale applications (Kalinke et al., 2024).

4. Statistical Accuracy and $\sqrt n$ -Consistency

Under Assumption A (the Stein features $\xi_p(X)$ are sub-Gaussian in $\mathcal{H}_k^d$ ), the Nyström-KSD estimator achieves a $\sqrt n$ -rate:

Theorem 4.1 ( $\sqrt n$ -consistency):

Let $C_{Q, h_p} = \mathbb{E}[h_p(\cdot, X) \otimes h_p(\cdot, X)]$ with $\mathrm{Tr}(C) = \mathrm{trace}\,C_{Q, h_p}$ . For regularization parameter $\lambda = c \mathrm{Tr}(C)/m$ , $m \geq 4$ , and $m \gtrsim \max\{\mathrm{Tr}(C)/\lambda, \log(1/\delta)\}$ , with probability at least $1-\delta$ ,

$|S_p(Q) - \tilde S_p(\widehat Q_n)| \lesssim \frac{\mathrm{Tr}(C) \log(1/\delta)}{n} + \sqrt{\frac{\mathrm{Tr}(C)\log(1/\delta)}{n}} + \frac{\sqrt{\mathrm{Tr}(C)\log(1/\delta)}}{m}\sqrt{N(\lambda) \log(n/\delta)}$

where $N(\lambda) = \mathrm{trace}(C(C+\lambda I)^{-1})$ .

If $N(\lambda) \lesssim \lambda^{-\gamma}$ (polynomial decay, $\gamma \in (0,1]$ ) or $N(\lambda) \lesssim \log(1+O(1)/\lambda)$ (exponential decay), then setting $m \sim n^{1/(2-\gamma)}$ yields $O_p(1/\sqrt n)$ estimation error, matching the quadratic estimator rate at sub-quadratic cost.

Proof Sketch:

Three contributions to the deviation are bounded: empirical mean error ( $O_p(1/\sqrt n)$ by Hilbert-space Bernstein), projection bias ( $O(\sqrt\lambda)$ by operator-valued Bernstein), and subsampling error (analyzed via conditional concentration and sub-Gaussian tail bounds). Setting $\lambda \sim \mathrm{Tr}(C)/m$ balances the bias-variance tradeoff (Kalinke et al., 2024).

5. Implementation Recommendations

Choice of $m$ (number of Nyström landmarks): For spectra with $N(\lambda) \sim \lambda^{-\gamma}$ , $m \gtrsim n^{1/(2-\gamma)}$ ensures $O(1/\sqrt n)$ error. In empirical work, $m \approx \sqrt n$ or $m = 4\sqrt n$ yielded favorable results.
Kernel selection and tuning: For smooth distributions, the Gaussian kernel $k(x, y) = \exp(-\gamma\|x-y\|^2)$ with the "median heuristic" for $\gamma$ is effective. For heavy-tailed alternatives, the non-stationary IMQ kernel $k(x, y) = (c^2 + \|x-y\|^2)^\beta$ with $\beta < 0$ generally increases test power. Parameters $c, \beta$ can be set using median-distance heuristics or other established routines.
Algorithmic behavior: Landmarks are drawn uniformly (with replacement) from the data indices; increased $m$ improves approximation but incurs higher cost.

6. Empirical Benchmarking and Comparative Performance

Kalinke et al. (2024) compared Nyström-KSD ("N-KSD") against:

Quadratic-time KSD with Gaussian and IMQ kernels,
Finite-set Stein discrepancy (FSSD-rand, FSSD-opt; linear time),
Random-feature Stein discrepancies (L¹-IMQ, L²-SechExp; near-linear time).

Test settings included Laplace vs. Normal ( $d=1\ldots 10$ , $n$ up to $1000$), Student- $t(5)$ vs. Normal ( $d$ up to $10$), and non-normalized RBM goodness-of-fit ( $n=1000$ ). N-KSD (with $m \approx \sqrt n$ ) was orders of magnitude faster than full KSD and, for small $n$ , the fastest among all approaches. In statistical power, N-KSD closely matched full KSD (especially with IMQ kernel) and robustly outperformed other fast Stein discrepancy approximations in all tested dimensions (Kalinke et al., 2024).

The Nyström approximation traces to Williams and Seeger (2001) for kernel acceleration, while classical KSD tests stem from Chwialkowski et al. (2016) and Liu et al. (2016). This estimator leverages tools from operator concentration (Koltchinskii and Lounici, 2017) for statistical analysis. Its main innovation is enabling $\sqrt n$ -consistent KSD-based inference with subquadratic complexity, bridging kernel Stein methods and scalable approximation strategies for reproducing kernel Hilbert spaces.

References:

Kalinke et al., "Nyström Kernel Stein Discrepancy" (Kalinke et al., 2024); Chwialkowski et al. (2016); Liu et al. (2016); Williams & Seeger (2001); Koltchinskii & Lounici (2017).

Markdown Report Issue Upgrade to Chat

References (1)

Nyström Kernel Stein Discrepancy (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nyström KSD Estimator.

Nyström KSD Estimator

1. Kernel Stein Discrepancy: Foundation and Quadratic Bottleneck

2. Nyström-Based KSD Estimation

2.1 Nyström Landmarks and Subspace Construction

2.2 Projection Formula and Computation

Pseudocode

3. Computational and Memory Complexity

4. Statistical Accuracy and $\sqrt n$ -Consistency

5. Implementation Recommendations

6. Empirical Benchmarking and Comparative Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Nyström KSD Estimator

1. Kernel Stein Discrepancy: Foundation and Quadratic Bottleneck

2. Nyström-Based KSD Estimation

2.1 Nyström Landmarks and Subspace Construction

2.2 Projection Formula and Computation

Pseudocode

3. Computational and Memory Complexity

4. Statistical Accuracy and n\sqrt nn​-Consistency

5. Implementation Recommendations

6. Empirical Benchmarking and Comparative Performance

7. Context and Related Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

4. Statistical Accuracy and $\sqrt n$ -Consistency