Bayesian Coresets in Scalable Inference

Updated 24 April 2026

Bayesian coresets are small, weighted subsets of data that approximate full posterior distributions to accelerate Bayesian inference.
They reduce computational and memory costs by leveraging data redundancy, enabling efficient methods like MCMC and variational inference.
Recent advances extend coresets to deep learning, robust, and streaming settings with theoretical guarantees on error bounds and performance.

A Bayesian coreset is a small, weighted (and possibly synthetic) subset of a dataset that is used in place of the full data to accelerate or scale Bayesian inference, while preserving high-fidelity approximations to the full posterior distribution. This concept leverages the often-high redundancy in large-scale data, reducing both computational and memory costs for methods such as Markov Chain Monte Carlo (MCMC) and variational inference. The field has diversified into classic (subset) Bayesian coresets, which reweight a small selection of real datapoints, and Bayesian pseudo-coresets, which learn synthetic datapoints and their weights, with recent advances further extending the methodology to deep learning, variational, and robust settings.

1. Formal Definitions: Bayesian Coresets and Pseudo-Coresets

Let $D = \{x_n\}_{n=1}^N$ be a dataset, $p(\theta)$ a prior, and $p(x_n \mid \theta)$ the likelihood. The full posterior is

$p(\theta | D) \propto p(\theta) \prod_{n=1}^N p(x_n \mid \theta).$

A Bayesian coreset is a small weighted subset $S = \{(w_i, x_i)\}_{i=1}^M$ with $w_i \geq 0$ , $M \ll N$ , such that

$p(\theta | S) \propto p(\theta) \prod_{i=1}^M p(x_i | \theta)^{w_i}$

closely approximates $p(\theta | D)$ . The selection of $S$ and $p(\theta)$ 0 is typically posed as minimizing a divergence (e.g., KL) between the coreset and true posteriors (Huggins et al., 2016, Campbell et al., 2017).

A pseudo-coreset (Bayesian pseudo-coreset, BPC) generalizes by learning synthetic inputs $p(\theta)$ 1 (not necessarily part of $p(\theta)$ 2) and optionally pseudo-labels $p(\theta)$ 3:

$p(\theta)$ 4

such that

$p(\theta)$ 5

best matches the true posterior in a chosen divergence (Kim et al., 2022, Lee et al., 28 Feb 2025).

2. Optimization Frameworks and Divergences

Coreset construction is formalized as a (usually nonconvex) optimization:

$p(\theta)$ 6

where $p(\theta)$ 7 is a divergence or distance:

Reverse KL: mode-seeking, favoring selectivity
Forward KL: mass-covering, especially suitable for capturing multimodality (Kim et al., 2022)
Wasserstein distance: matches moments or gradient flows
Contrastive Divergence: difference of two forward KLs involving MCMC transitions (Tiwary et al., 2023)

For pseudo-coresets, all parameters (locations, labels, weights) are optimized using gradient-based procedures, often with stochastic approximations or MCMC in the inner loop (Manousakas et al., 2022, Tiwary et al., 2023).

Recent work has introduced bilevel optimization and variational approximations to scale pseudo-coreset training to deep BNNs (Lee et al., 28 Feb 2025, Kim et al., 2023).

3. Theoretical Error Bounds, Guarantees, and Limitations

General upper and lower bounds on the quality of Bayesian coreset approximations are established in terms of Kullback–Leibler divergence (Campbell, 2024):

Lower bounds: Unless the coreset captures all directions in parameter space, the minimum KL error is $p(\theta)$ 8 unless the coreset grows linearly in $p(\theta)$ 9. Even with optimal rescaling, naive importance-sampling coresets require $p(x_n \mid \theta)$ 0 to control the KL error.
Upper bounds: Under a subexponentiality criterion on the log-likelihood (weaker than global log-concavity), optimized coresets of polylogarithmic size can ensure $p(x_n \mid \theta)$ 1 KL error: $p(x_n \mid \theta)$ 2.

For pseudo-coresets, tighter error bounds are more model-specific, but convergence of variational objectives and correlation-based greedy selection both have sublinear or geometric decay under mild conditions (Campbell et al., 2019). Exponential compression ( $p(x_n \mid \theta)$ 3) is achievable in settings such as Gaussian mean inference (Chen et al., 2022).

4. Algorithms and Methodological Variants

Subset Coresets

Method	Summary	Key Properties
Hilbert/Fisher coresets	Greedy, Frank-Wolfe in Hilbert-norm on log-likelihoods	$p(x_n \mid \theta)$ 4 error decay, fully automated (Campbell et al., 2017)
Quasi-Newton (QNC)	Subsample + quasi-Newton refinement using stochastic gradients	Black-box, $p(x_n \mid \theta)$ 5, exponential convergence (Naik et al., 2022)
Riemannian/SparseVI	Greedy selection maximizing Fisher correlation (natural gradient)	Information-geometric, consistent (Campbell et al., 2019)
Sparse Hamiltonian Flows	Combines HMC over sparse posteriors with quasi-refreshment	Exponential compression (Chen et al., 2022)
Accelerated Hard-Thresholding	Direct $p(x_n \mid \theta)$ 6-constrained nonconvex regression	Linear convergence under RIP (Zhang et al., 2020)

Pseudo-Coresets

Method	Divergence/Objective	Notable Features
Reverse KL pseudo-coreset (BPC-rKL)	Mode-seeking KL	Corresponds to DC in dataset distill. (Kim et al., 2022)
Forward KL pseudo-coreset (BPC-fKL)	Mass-covering KL	Avoids mode collapse, efficient (Kim et al., 2022)
Wasserstein pseudo-coreset	2-Wasserstein	Equivalent to MTT in distillation
Contrastive Divergence BPC	Difference of forward KLs	No need for stationary samples from $p(x_n \mid \theta)$ 7; finite MCMC (Tiwary et al., 2023)
Function-Space BPC (FBPC)	Function-space KL	Matches function posteriors; robust to weight multi-modality (Kim et al., 2023)
Variational Bayesian Pseudo-Coreset (VBPC)	Bilevel last-layer VI	Closed-form VI for last layer; scales to BNNs (Lee et al., 28 Feb 2025)
Black-box Coreset VI	Stochastic VI with VI proposals	Handles general intractable posteriors (Manousakas et al., 2022)

Synthetic pseudo-coresets are optimized via backpropagation, often employing last-layer VI (mean-field/block-Gaussian), bi-level objectives, or stochastic gradients through MCMC inner loops. Closed-form solutions are available for last-layer Gaussian regression (Lee et al., 28 Feb 2025). Algorithmic ingenuity often focuses on scaling memory, reducing the number of distinct $p(x_n \mid \theta)$ 8 operations, or facilitating streaming/parallelization.

5. Empirical Benchmarks and Practical Trade-offs

Empirical evaluations on canonical datasets (MNIST, CIFAR-10/100, Fashion-MNIST, Tiny-ImageNet, SVHN, ImageWoof, ImageNet1k) consistently show that:

Well-optimized coresets (greedy, quasi-Newton, variational, or pseudo-coreset) match full-data or MFVI performance on test accuracy, negative log-likelihood, and out-of-distribution robustness at $p(x_n \mid \theta)$ 9 to $p(\theta | D) \propto p(\theta) \prod_{n=1}^N p(x_n \mid \theta).$ 0 coreset size for data sizes up to $p(\theta | D) \propto p(\theta) \prod_{n=1}^N p(x_n \mid \theta).$ 1 or more
Pseudocoresets constructed via forward KL, contrastive divergence, and function-space objectives outperform mode-seeking or unoptimized pseudo-coresets in high-dimensional or deep settings (Tiwary et al., 2023, Kim et al., 2023, Lee et al., 28 Feb 2025)
Function-space matching (FBPC) achieves best robustness and uncertainty quantification in deep BNNs (Kim et al., 2023)
Uniform and importance-sampling coreset methods fail to control posterior approximation unless $p(\theta | D) \propto p(\theta) \prod_{n=1}^N p(x_n \mid \theta).$ 2 (Campbell, 2024)

Ablation studies show diminishing returns beyond $p(\theta | D) \propto p(\theta) \prod_{n=1}^N p(x_n \mid \theta).$ 3 images-per-class in modern image learning (Lee et al., 28 Feb 2025). Sensitivity to initialization, optimizer, and VI temperature is modest, indicating robustness of recent methods.

6. Extensions: Robustness, Streaming, and Tuning-Free Optimization

Robust Bayesian coreset constructions based on density-power (β-) divergence explicitly downweight outliers, maintain low KL error under heavy contamination, and have analogous stochastic and information-geometric optimization frameworks (Manousakas et al., 2020).

For streaming and parallel settings, coresets can be merged or recalculated in blocks, preserving error bounds and enabling distributed or online Bayesian inference (Huggins et al., 2016).

Recent advances have automated critical hyperparameters:

Coreset MCMC with Hot DoG: Learning-rate-free stochastic gradient descent for Bayesian coreset weight tuning, using a hot-start diagnostic to ensure chain mixing and robust RMSProp-style updates with theoretical $p(\theta | D) \propto p(\theta) \prod_{n=1}^N p(x_n \mid \theta).$ 4 convergence (Chen et al., 2024)
Black-box VI: Handles arbitrary likelihood and prior models, scaling coresets to large, intractable posteriors such as BNNs (Manousakas et al., 2022)

7. Open Challenges and Outlook

Precise characterization of trade-offs between coreset size and statistical fidelity for multimodal, heavy-tailed, or non-identifiable models remains incompletely resolved. Model-agnostic polylogarithmic-size bounds have recently been obtained under subexponential conditions (Campbell, 2024).
Extensions to hierarchically structured, group, or sequential coresets, as well as application to privacy-preserving and federated learning, are active areas of investigation.
Automatic differentiation, memory efficiency for storing/optimizing synthetic pseudo-coreset points, and the development of scalable variational objectives for possibly non-convex outer problems remain important technical frontiers.

Bayesian coresets and pseudo-coresets have matured into a general, theoretically justified, and highly practical methodology for scalable Bayesian inference across a wide range of modern statistical and machine learning models, including deep neural architectures and robust, real-world applications (Lee et al., 28 Feb 2025, Campbell, 2024, Tiwary et al., 2023, Kim et al., 2023).