Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nyström KSD Estimator

Updated 23 February 2026
  • The technique leverages Nyström landmarks to project kernel Stein discrepancy estimates, maintaining √n-consistency under sub-Gaussian assumptions.
  • It reduces computational complexity from quadratic to near-linear by evaluating the kernel on a small subset of samples.
  • Empirical benchmarks show that its statistical power closely matches full KSD tests while enabling scalable goodness-of-fit testing.

The Nyström Kernel Stein Discrepancy (Nyström-KSD) estimator is a computationally efficient technique for assessing the goodness-of-fit between a sample distribution and a target density known up to a normalization constant. It accelerates the canonical quadratic-time kernel Stein discrepancy (KSD) statistic using a low-rank projection based on Nyström landmarks. This estimator maintains n\sqrt n-consistency under sub-Gaussian assumptions while dramatically reducing time and memory complexity, enabling large-scale goodness-of-fit testing in high-dimensional settings (Kalinke et al., 2024).

1. Kernel Stein Discrepancy: Foundation and Quadratic Bottleneck

Let PP denote a target probability density pp on Rd\mathbb{R}^d known up to normalization, and QQ a data distribution. The base kernel kk (e.g., Gaussian or IMQ) with RKHS Hk\mathcal{H}_k and feature map ϕk(x)=k(,x)\phi_k(x) = k(\cdot, x) yields the Langevin Stein operator:

(Tpf)(x)=xlogp(x),f(x)+trace(xf(x))(T_p f)(x) = \langle \nabla_x \log p(x), f(x) \rangle + \mathrm{trace}(\nabla_x f(x))

for fHkdf \in \mathcal{H}_k^d. This operator satisfies:

(Tpf)(x)=f,ξp(x)Hkd(T_p f)(x) = \langle f, \xi_p(x) \rangle_{\mathcal{H}_k^d}

where the Stein feature vector is

ξp(x)=xlogp(x)ϕk(x)+xϕk(x)Hkd.\xi_p(x) = \nabla_x \log p(x)\,\phi_k(x) + \nabla_x \phi_k(x) \in \mathcal{H}_k^d.

The induced Stein kernel is

hp(x,y)=ξp(x),ξp(y)Hkd.h_p(x, y) = \langle \xi_p(x), \xi_p(y) \rangle_{\mathcal{H}_k^d}.

The population KSD is thus

Sp(Q)=EXQ[ξp(X)]Hkd=EXQ[hp(,X)]Hhp.S_p(Q) = \left\| \mathbb{E}_{X \sim Q}[\xi_p(X)] \right\|_{\mathcal{H}_k^d} = \left\| \mathbb{E}_{X \sim Q} [h_p(\cdot, X)] \right\|_{\mathcal{H}_{h_p}}.

As EP[ξp(X)]=0\mathbb{E}_P[\xi_p(X)] = 0, it follows that Sp(P)=0S_p(P) = 0.

Given samples {xi}i=1nQ\{x_i\}_{i=1}^n \sim Q, two classical estimators are used:

  • V-statistic:

Sp2(Q^n)=1n2i,j=1nhp(xi,xj)S_p^2(\widehat Q_n) = \frac{1}{n^2} \sum_{i, j=1}^n h_p(x_i, x_j)

  • U-statistic (unbiased):

Sp,u2(Q^n)=1n(n1)ijhp(xi,xj)S_{p,u}^2(\widehat Q_n) = \frac{1}{n(n-1)} \sum_{i \ne j} h_p(x_i, x_j)

Both approaches require construction and summation over the n×nn \times n matrix hp(xi,xj)h_p(x_i, x_j), resulting in O(n2)\mathcal{O}(n^2) time and memory, which is prohibitive for n103n \gg 10^3 (Kalinke et al., 2024).

2. Nyström-Based KSD Estimation

2.1 Nyström Landmarks and Subspace Construction

The Nyström-KSD approach projects the empirical mean embedding

μhp(Q^n)=1ni=1nhp(,xi)\mu_{h_p}(\widehat Q_n) = \frac{1}{n} \sum_{i=1}^n h_p(\cdot, x_i)

onto a subspace spanned by mnm \ll n randomly selected "Nyström points." These are zj=xijz_j = x_{i_j}, where the iji_j are chosen with replacement from {1,...,n}\{1, ..., n\}.

Define

Hhp,m=span{hp(,zj):1jm}Hhp.\mathcal{H}_{h_p, m} = \mathrm{span}\{ h_p(\cdot, z_j) : 1 \leq j \leq m \} \subset \mathcal{H}_{h_p}.

Equivalently, in Hkd\mathcal{H}_k^d,

Hp,m=span{ξp(zj):j=1,...,m}Hkd.\mathcal{H}_{p,m} = \mathrm{span}\{\xi_p(z_j): j=1, ..., m\} \subset \mathcal{H}_k^d.

Projection of the empirical mean embedding onto this subspace yields the estimator.

2.2 Projection Formula and Computation

The projected mean is

μ^Nys=PHhp,mμ^\hat{\mu}^{\mathrm{Nys}} = P_{\mathcal{H}_{h_p,m}} \hat\mu

where μ^=1ni=1nhp(,xi)\hat\mu = \frac{1}{n} \sum_{i=1}^n h_p(\cdot, x_i). Its squared norm is

S~p2(Q^n)=μ^NysHhp2=βKmmβ\tilde S_p^2(\widehat Q_n) = \|\hat\mu^{\mathrm{Nys}}\|_{\mathcal{H}_{h_p}}^2 = \beta^\top K_{mm}^{-} \beta

with

Kmm=[hp(zj,zj)]j,j=1mRm×m Kmn=[hp(zj,xi)]j=1,,m,i=1,,nRm×n β=1nKmn1nRm\begin{align*} K_{mm} &= [h_p(z_j, z_{j'})]_{j,j'=1}^m \in \mathbb{R}^{m \times m} \ K_{mn} &= [h_p(z_j, x_i)]_{j=1,\ldots, m,\,i=1,\ldots,n} \in \mathbb{R}^{m \times n} \ \beta &= \frac{1}{n} K_{mn} 1_n \in \mathbb{R}^m \end{align*}

and KmmK_{mm}^{-} is the Moore–Penrose pseudoinverse.

Pseudocode

1

The time is dominated by O(mn)\mathcal{O}(mn) evaluations for KmnK_{mn} and O(m3)\mathcal{O}(m^3) for the pseudo-inverse.

3. Computational and Memory Complexity

Estimator Time Complexity Memory Requirement
Quadratic-time KSD (V/U) O(n2)\mathcal{O}(n^2) O(n2)\mathcal{O}(n^2)
Nyström-KSD O(mn+m3)\mathcal{O}(mn + m^3) O(mn+m2)\mathcal{O}(mn + m^2)

For m=o(n2/3)m = o(n^{2/3}), Nyström-KSD achieves a strictly sub-quadratic runtime and memory. When mnm \approx \sqrt n, the cost becomes near-linear in nn, representing significant acceleration in large-scale applications (Kalinke et al., 2024).

4. Statistical Accuracy and n\sqrt n-Consistency

Under Assumption A (the Stein features ξp(X)\xi_p(X) are sub-Gaussian in Hkd\mathcal{H}_k^d), the Nyström-KSD estimator achieves a n\sqrt n-rate:

Theorem 4.1 (n\sqrt n-consistency):

Let CQ,hp=E[hp(,X)hp(,X)]C_{Q, h_p} = \mathbb{E}[h_p(\cdot, X) \otimes h_p(\cdot, X)] with Tr(C)=traceCQ,hp\mathrm{Tr}(C) = \mathrm{trace}\,C_{Q, h_p}. For regularization parameter λ=cTr(C)/m\lambda = c \mathrm{Tr}(C)/m, m4m \geq 4, and mmax{Tr(C)/λ,log(1/δ)}m \gtrsim \max\{\mathrm{Tr}(C)/\lambda, \log(1/\delta)\}, with probability at least 1δ1-\delta,

Sp(Q)S~p(Q^n)Tr(C)log(1/δ)n+Tr(C)log(1/δ)n+Tr(C)log(1/δ)mN(λ)log(n/δ)|S_p(Q) - \tilde S_p(\widehat Q_n)| \lesssim \frac{\mathrm{Tr}(C) \log(1/\delta)}{n} + \sqrt{\frac{\mathrm{Tr}(C)\log(1/\delta)}{n}} + \frac{\sqrt{\mathrm{Tr}(C)\log(1/\delta)}}{m}\sqrt{N(\lambda) \log(n/\delta)}

where N(λ)=trace(C(C+λI)1)N(\lambda) = \mathrm{trace}(C(C+\lambda I)^{-1}).

If N(λ)λγN(\lambda) \lesssim \lambda^{-\gamma} (polynomial decay, γ(0,1]\gamma \in (0,1]) or N(λ)log(1+O(1)/λ)N(\lambda) \lesssim \log(1+O(1)/\lambda) (exponential decay), then setting mn1/(2γ)m \sim n^{1/(2-\gamma)} yields Op(1/n)O_p(1/\sqrt n) estimation error, matching the quadratic estimator rate at sub-quadratic cost.

Proof Sketch:

Three contributions to the deviation are bounded: empirical mean error (Op(1/n)O_p(1/\sqrt n) by Hilbert-space Bernstein), projection bias (O(λ)O(\sqrt\lambda) by operator-valued Bernstein), and subsampling error (analyzed via conditional concentration and sub-Gaussian tail bounds). Setting λTr(C)/m\lambda \sim \mathrm{Tr}(C)/m balances the bias-variance tradeoff (Kalinke et al., 2024).

5. Implementation Recommendations

  • Choice of mm (number of Nyström landmarks): For spectra with N(λ)λγN(\lambda) \sim \lambda^{-\gamma}, mn1/(2γ)m \gtrsim n^{1/(2-\gamma)} ensures O(1/n)O(1/\sqrt n) error. In empirical work, mnm \approx \sqrt n or m=4nm = 4\sqrt n yielded favorable results.
  • Kernel selection and tuning: For smooth distributions, the Gaussian kernel k(x,y)=exp(γxy2)k(x, y) = \exp(-\gamma\|x-y\|^2) with the "median heuristic" for γ\gamma is effective. For heavy-tailed alternatives, the non-stationary IMQ kernel k(x,y)=(c2+xy2)βk(x, y) = (c^2 + \|x-y\|^2)^\beta with β<0\beta < 0 generally increases test power. Parameters c,βc, \beta can be set using median-distance heuristics or other established routines.
  • Algorithmic behavior: Landmarks are drawn uniformly (with replacement) from the data indices; increased mm improves approximation but incurs higher cost.

6. Empirical Benchmarking and Comparative Performance

Kalinke et al. (2024) compared Nyström-KSD ("N-KSD") against:

  • Quadratic-time KSD with Gaussian and IMQ kernels,
  • Finite-set Stein discrepancy (FSSD-rand, FSSD-opt; linear time),
  • Random-feature Stein discrepancies (L¹-IMQ, L²-SechExp; near-linear time).

Test settings included Laplace vs. Normal (d=110d=1\ldots 10, nn up to $1000$), Student-t(5)t(5) vs. Normal (dd up to $10$), and non-normalized RBM goodness-of-fit (n=1000n=1000). N-KSD (with mnm \approx \sqrt n) was orders of magnitude faster than full KSD and, for small nn, the fastest among all approaches. In statistical power, N-KSD closely matched full KSD (especially with IMQ kernel) and robustly outperformed other fast Stein discrepancy approximations in all tested dimensions (Kalinke et al., 2024).

The Nyström approximation traces to Williams and Seeger (2001) for kernel acceleration, while classical KSD tests stem from Chwialkowski et al. (2016) and Liu et al. (2016). This estimator leverages tools from operator concentration (Koltchinskii and Lounici, 2017) for statistical analysis. Its main innovation is enabling n\sqrt n-consistent KSD-based inference with subquadratic complexity, bridging kernel Stein methods and scalable approximation strategies for reproducing kernel Hilbert spaces.


References:

Kalinke et al., "Nyström Kernel Stein Discrepancy" (Kalinke et al., 2024); Chwialkowski et al. (2016); Liu et al. (2016); Williams & Seeger (2001); Koltchinskii & Lounici (2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nyström KSD Estimator.