Kernel Herding in RKHS Sampling

Updated 21 February 2026

Kernel herding is a deterministic quadrature method that leverages RKHS geometry to approximate target distributions and minimize integration errors.
It employs a greedy conditional gradient (Frank–Wolfe) approach to select sample points, ensuring efficient convergence and optimized weight reallocation.
Kernel herding underpins applications in adaptive quadrature, particle filtering, and distribution compression, providing theoretical guarantees and improved space-filling properties.

Kernel herding is a deterministic sampling and quadrature technique grounded in the geometry of reproducing kernel Hilbert spaces (RKHS). Its goal is to construct finite sets of weighted points whose empirical (or quadrature) measure closely approximates a target probability distribution in the RKHS norm, thereby minimizing worst-case integration error for functions in that space. The method operates by iteratively selecting sample locations that greedily decrease the maximum mean discrepancy between the empirical and target distributions. Kernel herding admits equivalent formulations as a conditional gradient (Frank–Wolfe) procedure for minimizing a quadratic moment discrepancy and is intimately connected to Bayesian quadrature, providing theoretical guarantees and computational strategies for fast-converging quadrature rules in both finite and infinite-dimensional settings (Bach et al., 2012, Huszár et al., 2012, Huszar et al., 2014, Chen et al., 2012, Lacoste-Julien et al., 2015, Rouault et al., 2024).

1. Theoretical Foundation and Formulation

Let $K:\mathcal{X}\times \mathcal{X}\rightarrow\mathbb{R}$ be a continuous, positive-definite kernel with associated RKHS $\mathcal{H}_K$ . For a probability distribution $\pi$ over $\mathcal{X}$ , its kernel mean embedding is $\mu_K = \int K(\cdot, x) d\pi(x)$ . Given a finite set of nodes $\{x_i\}_{i=1}^n$ and weights $w_i\geq 0$ with $\sum_i w_i=1$ , the empirical embedding is $\mu_{n,K} = \sum_{i=1}^n w_i K(\cdot, x_i)$ . The core objective is to control the worst-case integration error

$E_n = \sup_{\|f\|_{\mathcal{H}_K}\leq 1} \left|\int f(x) d\pi(x) - \sum_{i=1}^n w_i f(x_i)\right| = \|\mu_K - \mu_{n,K}\|_{\mathcal{H}_K}.$

This is equivalent to the maximum mean discrepancy (MMD) between $\pi$ and the empirical measure and forms the theoretical backbone of kernel herding and related quadrature approaches (Bach et al., 2012, Huszár et al., 2012, Huszar et al., 2014, Rouault et al., 2024).

2. Kernel Herding Algorithm: Greedy Conditional Gradient Approach

Kernel herding operates as a greedy procedure iteratively selecting $x_t$ to maximally reduce the worst-case RKHS error. The standard kernel herding update is

$x_t \in \arg\max_{x\in\mathcal{X}} \left\langle\mu_K - \frac{1}{t-1}\sum_{i=1}^{t-1}K(\cdot, x_i),\ K(\cdot, x)\right\rangle_{\mathcal{H}_K} = \arg\max_{x} U_K^\pi(x) - \frac{1}{t-1}\sum_{i=1}^{t-1}K(x, x_i),$

where $U_K^\pi(x) = \int K(x, y) d\pi(y)$ . The update is a Frank–Wolfe (conditional gradient) step for the objective $F(\nu) = \frac{1}{2}\|\mu_K - \nu\|_{\mathcal{H}_K}^2$ on the marginal polytope (the convex hull of feature maps), equivalently viewed as minimization of quadratic moment discrepancy (Bach et al., 2012, Lacoste-Julien et al., 2015, Rouault et al., 2024).

Alternate variants include:

Line-search conditional gradient: adaptive step size to achieve potentially faster convergence under interior assumptions.
Minimum-norm point (active-set) algorithm: joint re-optimization of weights over accumulated support for further acceleration of convergence (Bach et al., 2012).

3. Convergence Guarantees and Rates

Classical kernel herding with fixed step size achieves $O(1/n)$ decay in the squared RKHS error, provided the mean embedding $\mu_K$ lies in the relative interior of the marginal polytope. When this geometric assumption fails (e.g., infinite-dimensional RKHS), only $O(1/n)$ decay in squared error (equivalently, $O(n^{-1/2})$ in the RKHS norm) can be proved (Bach et al., 2012, Huszár et al., 2012, Rouault et al., 2024). For Monte Carlo (i.i.d. sampling), the error only decays as $O_p(n^{-1/2})$ .

Weighted versions—particularly those corresponding to Bayesian quadrature, for which weights are re-optimized at each step to minimize posterior variance—can attain strictly faster rates (even exponential under strong smoothness/eigenvalue conditions) (Huszar et al., 2014, Huszár et al., 2012). Concentration inequalities for probabilistic herding based on Gibbs measures show exponentially tighter confidence bounds relative to classical Monte Carlo, with sub-Gaussian tails and constants boosted by an effective "inverse temperature" parameter (Rouault et al., 2024).

4. Extensions: Variants, Sparsity, and Sparse Quadrature

Several variants and extensions of kernel herding have been developed to address computational efficiency and solution sparsity:

Pairwise and Blended Pairwise Conditional Gradient (BPCG) algorithms eliminate "swap steps" that limit progress in the Pairwise Conditional Gradient (PCG) method, enabling direct application in infinite-dimensional RKHS. BPCG yields sparser quadrature rules and matches or surpasses classical herding in convergence and atomic support size (Tsuji et al., 2021).
Gradient Approximation and Fully-Corrective Schemes: By approximating the search direction using a positive combination of multiple vertex directions (e.g., pursuit strategies), one can accelerate decay in integration error and produce sparser quadrature nodes. Fully-corrective variants further optimize weights over the selected support, enhancing both error and atomic sparsity (Tsuji et al., 2021).

Empirical results show that BPCG and fully-corrective herding variants can outperform classical kernel herding in both worst-case MMD and sparsity, especially for Matérn and Gaussian kernels (Tsuji et al., 2021, Tsuji et al., 2021).

5. Connections to Bayesian Quadrature and Probabilistic Herding

Kernel herding minimizes exactly the same criterion as Bayesian quadrature: the RKHS norm between the mean embedding of the measure and its empirical approximation, which is also the posterior variance for integration under a kernel-based Gaussian process prior (Bach et al., 2012, Huszar et al., 2014, Huszár et al., 2012). The unweighted (classical) herding strategy is equivalent to Bayesian quadrature with uniform weights; optimally weighted herding corresponds to sequential Bayesian quadrature, attaining faster than $O(1/n)$ decay (Huszar et al., 2014).

Probabilistic herding generalizes the approach by introducing a Gibbs measure over node configurations with density proportional to $\exp(-\beta H_n)$ , where $H_n$ is an RKHS energy functional measuring repulsion through the kernel. This probabilistic version achieves exponentially sharp concentration inequalities on worst-case integration error, and empirical evidence suggests practical convergence at the $O(1/n)$ rate in moderate dimensions (Rouault et al., 2024).

6. Applications and Space-Filling Properties

Kernel herding and its variants are used in adaptive quadrature, sequential Bayesian quadrature, particle filtering, active learning, validation design, and distribution compression:

Particle Filtering: Replacing the random sampling step in a particle filter with herding yields immediate gains in efficiency, with provable $O(1/n)$ MMD convergence under mild conditions (Lacoste-Julien et al., 2015).
Validation Designs: Herding-based validation sets exhibit superior space-filling properties, systematically filling gaps left by existing design samples (Pronzato et al., 2021).
Distribution Compression: Classical herding and its joint/conditional extensions (e.g., JKH, ACKH, ACKIP) enable linear-time coreset construction with precise theoretical rates and improved conditional approximation (Broadbent et al., 14 Apr 2025).
High-dimensional and Structured Sampling: Methods such as Continuous Herded Gibbs Sampling replace full joint optimization with coordinate-wise greedy updates, ensuring high-dimensional scalability while maintaining the deterministic, fast-converging nature of kernel herding (Wolf et al., 2021).

Empirically, kernel herding yields better sample coverage (lower covering radius) and negative auto-correlations, comparable or superior to Quasi-Monte Carlo methods for a wide class of kernels and domains (Chen et al., 2012, Bach et al., 2012, Pronzato et al., 2021, Chen et al., 2012).

7. Outlook and Ongoing Research Directions

Current research at the intersection of convex optimization, statistical physics, and machine learning continues to expand the landscape of kernel herding:

Improved non-asymptotic confidence bounds (e.g., non-asymptotic central limit theorems for single-integrand functionals at $\beta^{-1/2}$ rates) (Rouault et al., 2024).
Development of new sampling and optimization algorithms (e.g., MALA for Gibbs-based probabilistic herding, efficient block-coordinate variants) (Rouault et al., 2024, Wolf et al., 2021).
Further exploitation of Stein kernels and domain generalization to unbounded supports without explicit kernel mean evaluation (Rouault et al., 2024).
Optimized compression of conditional distributions, outperforming joint kernel herding when the conditional is the primary target (Broadbent et al., 14 Apr 2025).

Kernel herding thus provides both a theoretical and computational bridge between deterministic quadrature, adaptive sampling, and modern nonparametric Bayesian inference frameworks (Chen et al., 2012, Rouault et al., 2024, Huszár et al., 2012).