ByPE-VAE: Bayesian Pseudocoresets in VAEs

Updated 25 February 2026

The paper introduces ByPE-VAE, which replaces a full-data mixture prior with a learned pseudocoreset prior to reduce computational cost while maintaining high performance.
ByPE-VAE employs an alternating training procedure that dynamically updates both VAE parameters and pseudocoreset elements using efficient gradient-based KL minimization.
Empirical results on datasets like MNIST, CIFAR-10, and CelebA demonstrate that ByPE-VAE improves density estimation, latent clustering, and training efficiency compared to traditional VAE variants.

ByPE-VAE (Bayesian Pseudocoresets Exemplar Variational Autoencoder) is a deep generative modeling framework that extends the standard variational autoencoder (VAE) architecture by introducing a data-dependent mixture prior based on a small, learned pseudocoreset, rather than the full dataset. By dynamically optimizing both the pseudocoreset elements and their mixture weights to closely match the ideal full-data-based mixture prior in Kullback-Leibler (KL) divergence, ByPE-VAE combines the expressive power of exemplar-based priors with significant computational efficiency and implicit regularization against overfitting. The approach demonstrates improvements across density estimation, unsupervised representation learning, and generative data augmentation tasks compared to traditional VAE variants and other advanced prior formulations (Ai et al., 2021).

1. Model Formulation and Theoretical Foundations

Given a sample $x$ from a dataset $X$ and latent variable $z\in\mathbb{R}^{d_z}$ , the standard VAE maximizes the evidence lower bound (ELBO):

$L(\theta,\phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \mathrm{KL}(q_\phi(z|x)\,\|\,p_0(z)),$

where $p_0(z)$ is typically the standard normal prior.

Exemplar VAE [Norouzi et al. 2020] replaces the standard normal prior with a data-dependent mixture:

$p_\phi(z|X) = \frac{1}{N}\sum_{n=1}^N r_\phi(z|x_n),$

where each $r_\phi(z|x_n)$ is a Gaussian whose mean and variance are produced by a learned network.

ByPE-VAE introduces a further innovation: it replaces the sum over all data points (computationally prohibitive for large $N$ ) by a mixture over a learned, weighted pseudocoreset $U = \{u_m\}_{m=1}^M$ ( $M \ll N$ ) with nonnegative weights $w_m$ ( $\sum_m w_m = N$ ):

$p_\phi(z|U,w) = \sum_{m=1}^M \frac{w_m}{N} r_\phi(z|u_m).$

This pseudocoreset mixture prior is trained to minimize $\mathrm{KL}(p_\phi(z|U,w)\|p_\phi(z|X))$ —i.e., to approximate the full data-dependent mixture.

The explicit ELBO optimized in ByPE-VAE is (dropping additive constants):

$O(\theta,\phi,U,w;x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \mathbb{E}_{q_\phi(z|x)}\bigg[\log q_\phi(z|x) - \log\sum_{m=1}^M \frac{w_m}{N} r_\phi(z|u_m)\bigg].$

2. Pseudocoreset Construction and KL Minimization

The pseudocoreset $(U^*, w^*)$ is chosen to minimize the KL divergence between the pseudocoreset mixture prior and the (intractable) full-data mixture prior:

$(U^*,w^*) = \arg\min_{U,\,w\geq 0:\,\sum_m w_m=N} \mathrm{KL}(p_\phi(z|U,w)\,\|\,p_\phi(z|X)).$

The KL divergence admits a gradient representation via covariances under the $p_\phi(z|U,w)$ posterior:

Gradient w.r.t. pseudocoreset points $u_m$ :

$\nabla_{u_m}\,\mathrm{KL} = -w_m\,\mathrm{Cov}_{z\sim p_\phi(z|U,w)} \bigg[ \nabla_{u_m}\log p_\theta(u_m|z),\; \log p_\theta(X|z)^\top\mathbf1_N - \log p_\theta(U|z)^\top w \bigg]$

Gradient w.r.t. weights $w$ :

$\nabla_w\,\mathrm{KL} = -\mathrm{Cov}_{z\sim p_\phi(z|U,w)}\left[ \log p_\theta(U|z),\; \log p_\theta(X|z)^\top\mathbf1_N - \log p_\theta(U|z)^\top w \right]$

Specializing to Gaussian $r_\phi(\cdot|u)$ , the gradients are efficiently computed by backpropagation through the generator network for means.

In practice, stochastic gradients are formed by sampling $z_s\sim p_\phi(z|U,w)$ and minibatches from $X$ , enabling scalable KL minimization even for moderate $M$ and large $N$ .

3. Alternating Training Procedure

ByPE-VAE employs a two-step alternating optimization:

VAE Parameter Step: With $(U,w)$ fixed, update encoder and decoder parameters $(\phi,\theta)$ by maximizing the ELBO above over minibatches.
Pseudocoreset Step: Every $k$ epochs (empirically, $k=10$ suffices), update both the pseudocoreset points and weights using stochastic estimates of the KL gradients as described.

The initialization starts by uniformly sampling $M$ pseudopoints from $X$ , with equal weights ( $w_m=N/M$ ). Pseudocoreset updates are projected to maintain nonnegativity and preserve the total weight constraint. Overall complexity is dominated by the VAE step (which is amortized for small $M$ ) and by the coreset step when $k$ is small.

4. Experimental Evaluation and Results

ByPE-VAE is evaluated on density estimation, representation learning, and generative data augmentation over several standard datasets: Dynamic MNIST, Fashion MNIST, CIFAR-10, and CelebA. Architectures involve both MLP and CNN backbones, with latent dimension $d_z=40$ and $M=500$ (or $M=240$ for CelebA).

Key empirical findings include:

Density Estimation: On Dynamic MNIST, Fashion MNIST, and CIFAR-10, ByPE-VAE achieves the best test negative log-likelihood (NLL), outperforming VAE+Gaussian, VAE+VampPrior, and Exemplar VAE mixtures.
Training Efficiency: ByPE-VAE reduces training time per epoch by approximately $3\times$ relative to Exemplar VAE—13.2 s vs. 35.5 s (Dynamic MNIST, $M=500$ vs. $N=25\,000$ components).
Latent Representations: t-SNE visualization of MNIST embeddings reveals tighter clusters and improved inter-class separation. kNN classification accuracy (on learned codes) is consistently higher under ByPE-VAE for all $k\in\{3,5,\dots,15\}$ .
Data Augmentation: Discriminative test error on permutation-invariant MNIST with augmented samples: ByPE-VAE achieves $1.10\%$ (posterior sampling), $0.88\%$ (prior) vs. $1.16\%/1.10\%$ for Exemplar.
Ablation Analyses: Performance remains stable as pseudocoreset update interval $k$ increases up to 50; ByPE-VAE outperforms Exemplar variants with equal coreset size.

Empirical Comparisons

Method	Dyn MNIST (NLL)	Fash MNIST (NLL)	CIFAR-10 (NLL)	Training Time/Epoch (Dyn MNIST, s)
VAE+Gauss prior	24.41	21.43	72.21	—
VAE+VampPrior	23.65	20.87	71.97	13.0
VAE+Exemplar	23.83	21.00	72.55	35.5
ByPE-VAE	23.61	20.85	71.91	13.2

This demonstrates that ByPE-VAE matches or exceeds benchmarks on both data efficiency and sample quality (Ai et al., 2021).

5. Analysis of Computational and Statistical Properties

Computational Efficiency: ByPE-VAE's computational savings result from replacing a full sum over $N$ exemplars in the prior with a sum over $M \ll N$ learned pseudopoints. Because pseudocoreset updates are amortized (run every $k$ epochs), the additional cost is negligible relative to the overall VAE training cycle.

Regularization and Overfitting Avoidance: Pseudocoreset construction acts as an implicit regularizer. Only a small set of points and their weights are used to assemble the mixture prior, mitigating risks of overfitting or memorization associated with full-dataset-dependent priors.

Optimization and Sensitivity: The KL-based pseudocoreset adaptation is sensitive to hyperparameters $M$ , $k$ , and $S$ , as well as initialization; if $M$ is too small or initialization is poor, the coreset may underfit the true data distribution.

6. Limitations and Possible Extensions

While ByPE-VAE achieves substantial improvements, limitations remain:

Coreset Update Overhead: The requisite stochastic updates for pseudocoreset points and weights still incur nontrivial cost (proportional to positive integers $S\sim50$ , $B\sim100$ , $M\sim500$ ).
Hyperparameter Sensitivity: Careful selection of $M$ , $k$ , and learning rates $\gamma_t$ is necessary to avoid degraded performance.
Potential Underfitting: If the coreset size $M$ is too small, the induced prior may not adequately match the full-data mixture, especially for complex data or highly multi-modal latent structures.

Proposed extensions and areas for future work include:

Amortized pseudocoresets: replacing static pseudopoints with a generative model of $u_m$ .
Hierarchical pseudocoreset priors: employing clusters within clusters to further enhance prior expressiveness.
Application to advanced likelihood models: incorporating pseudocoresets into normalizing flow architectures.
Dynamic coreset adjustment: growing or pruning $M$ dynamically during training to maintain model flexibility.

A plausible implication is that pseudocoresets could be adapted for other Bayesian deep learning settings that benefit from expressive, computationally efficient data-dependent priors.

7. Context and Broader Impact in Variational Learning

ByPE-VAE demonstrates that judiciously optimizing a lightweight, learnable pseudocoreset can enable practical approximation of highly flexible, data-dependent mixture priors, expanding the toolkit available for expressive probabilistic modeling without incurring prohibitive costs. Its success highlights the tradeoff between model complexity, inference tractability, and regularization—a central concern for variational learning frameworks. The methodology is broadly applicable across density estimation, unsupervised embedding, and generative modeling, and serves as a foundation for ongoing research into scalable, highly expressive Bayesian generative models (Ai et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

ByPE-VAE: Bayesian Pseudocoresets Exemplar VAE (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByPE-VAE.

ByPE-VAE: Bayesian Pseudocoresets in VAEs

1. Model Formulation and Theoretical Foundations

2. Pseudocoreset Construction and KL Minimization

3. Alternating Training Procedure

4. Experimental Evaluation and Results

Empirical Comparisons

5. Analysis of Computational and Statistical Properties

6. Limitations and Possible Extensions

7. Context and Broader Impact in Variational Learning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ByPE-VAE: Bayesian Pseudocoresets in VAEs

1. Model Formulation and Theoretical Foundations

2. Pseudocoreset Construction and KL Minimization

3. Alternating Training Procedure

4. Experimental Evaluation and Results

Empirical Comparisons

5. Analysis of Computational and Statistical Properties

6. Limitations and Possible Extensions

7. Context and Broader Impact in Variational Learning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research