Papers
Topics
Authors
Recent
Search
2000 character limit reached

ByPE-VAE: Bayesian Pseudocoresets in VAEs

Updated 25 February 2026
  • The paper introduces ByPE-VAE, which replaces a full-data mixture prior with a learned pseudocoreset prior to reduce computational cost while maintaining high performance.
  • ByPE-VAE employs an alternating training procedure that dynamically updates both VAE parameters and pseudocoreset elements using efficient gradient-based KL minimization.
  • Empirical results on datasets like MNIST, CIFAR-10, and CelebA demonstrate that ByPE-VAE improves density estimation, latent clustering, and training efficiency compared to traditional VAE variants.

ByPE-VAE (Bayesian Pseudocoresets Exemplar Variational Autoencoder) is a deep generative modeling framework that extends the standard variational autoencoder (VAE) architecture by introducing a data-dependent mixture prior based on a small, learned pseudocoreset, rather than the full dataset. By dynamically optimizing both the pseudocoreset elements and their mixture weights to closely match the ideal full-data-based mixture prior in Kullback-Leibler (KL) divergence, ByPE-VAE combines the expressive power of exemplar-based priors with significant computational efficiency and implicit regularization against overfitting. The approach demonstrates improvements across density estimation, unsupervised representation learning, and generative data augmentation tasks compared to traditional VAE variants and other advanced prior formulations (Ai et al., 2021).

1. Model Formulation and Theoretical Foundations

Given a sample xx from a dataset XX and latent variable zRdzz\in\mathbb{R}^{d_z}, the standard VAE maximizes the evidence lower bound (ELBO):

L(θ,ϕ;x)=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)p0(z)),L(\theta,\phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \mathrm{KL}(q_\phi(z|x)\,\|\,p_0(z)),

where p0(z)p_0(z) is typically the standard normal prior.

Exemplar VAE [Norouzi et al. 2020] replaces the standard normal prior with a data-dependent mixture:

pϕ(zX)=1Nn=1Nrϕ(zxn),p_\phi(z|X) = \frac{1}{N}\sum_{n=1}^N r_\phi(z|x_n),

where each rϕ(zxn)r_\phi(z|x_n) is a Gaussian whose mean and variance are produced by a learned network.

ByPE-VAE introduces a further innovation: it replaces the sum over all data points (computationally prohibitive for large NN) by a mixture over a learned, weighted pseudocoreset U={um}m=1MU = \{u_m\}_{m=1}^M (MNM \ll N) with nonnegative weights wmw_m (mwm=N\sum_m w_m = N):

pϕ(zU,w)=m=1MwmNrϕ(zum).p_\phi(z|U,w) = \sum_{m=1}^M \frac{w_m}{N} r_\phi(z|u_m).

This pseudocoreset mixture prior is trained to minimize KL(pϕ(zU,w)pϕ(zX))\mathrm{KL}(p_\phi(z|U,w)\|p_\phi(z|X))—i.e., to approximate the full data-dependent mixture.

The explicit ELBO optimized in ByPE-VAE is (dropping additive constants):

O(θ,ϕ,U,w;x)=Eqϕ(zx)[logpθ(xz)]Eqϕ(zx)[logqϕ(zx)logm=1MwmNrϕ(zum)].O(\theta,\phi,U,w;x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \mathbb{E}_{q_\phi(z|x)}\bigg[\log q_\phi(z|x) - \log\sum_{m=1}^M \frac{w_m}{N} r_\phi(z|u_m)\bigg].

2. Pseudocoreset Construction and KL Minimization

The pseudocoreset (U,w)(U^*, w^*) is chosen to minimize the KL divergence between the pseudocoreset mixture prior and the (intractable) full-data mixture prior:

(U,w)=argminU,w0:mwm=NKL(pϕ(zU,w)pϕ(zX)).(U^*,w^*) = \arg\min_{U,\,w\geq 0:\,\sum_m w_m=N} \mathrm{KL}(p_\phi(z|U,w)\,\|\,p_\phi(z|X)).

The KL divergence admits a gradient representation via covariances under the pϕ(zU,w)p_\phi(z|U,w) posterior:

  • Gradient w.r.t. pseudocoreset points umu_m:

umKL=wmCovzpϕ(zU,w)[umlogpθ(umz),  logpθ(Xz)1Nlogpθ(Uz)w]\nabla_{u_m}\,\mathrm{KL} = -w_m\,\mathrm{Cov}_{z\sim p_\phi(z|U,w)} \bigg[ \nabla_{u_m}\log p_\theta(u_m|z),\; \log p_\theta(X|z)^\top\mathbf1_N - \log p_\theta(U|z)^\top w \bigg]

  • Gradient w.r.t. weights ww:

wKL=Covzpϕ(zU,w)[logpθ(Uz),  logpθ(Xz)1Nlogpθ(Uz)w]\nabla_w\,\mathrm{KL} = -\mathrm{Cov}_{z\sim p_\phi(z|U,w)}\left[ \log p_\theta(U|z),\; \log p_\theta(X|z)^\top\mathbf1_N - \log p_\theta(U|z)^\top w \right]

Specializing to Gaussian rϕ(u)r_\phi(\cdot|u), the gradients are efficiently computed by backpropagation through the generator network for means.

In practice, stochastic gradients are formed by sampling zspϕ(zU,w)z_s\sim p_\phi(z|U,w) and minibatches from XX, enabling scalable KL minimization even for moderate MM and large NN.

3. Alternating Training Procedure

ByPE-VAE employs a two-step alternating optimization:

  1. VAE Parameter Step: With (U,w)(U,w) fixed, update encoder and decoder parameters (ϕ,θ)(\phi,\theta) by maximizing the ELBO above over minibatches.
  2. Pseudocoreset Step: Every kk epochs (empirically, k=10k=10 suffices), update both the pseudocoreset points and weights using stochastic estimates of the KL gradients as described.

The initialization starts by uniformly sampling MM pseudopoints from XX, with equal weights (wm=N/Mw_m=N/M). Pseudocoreset updates are projected to maintain nonnegativity and preserve the total weight constraint. Overall complexity is dominated by the VAE step (which is amortized for small MM) and by the coreset step when kk is small.

4. Experimental Evaluation and Results

ByPE-VAE is evaluated on density estimation, representation learning, and generative data augmentation over several standard datasets: Dynamic MNIST, Fashion MNIST, CIFAR-10, and CelebA. Architectures involve both MLP and CNN backbones, with latent dimension dz=40d_z=40 and M=500M=500 (or M=240M=240 for CelebA).

Key empirical findings include:

  • Density Estimation: On Dynamic MNIST, Fashion MNIST, and CIFAR-10, ByPE-VAE achieves the best test negative log-likelihood (NLL), outperforming VAE+Gaussian, VAE+VampPrior, and Exemplar VAE mixtures.
  • Training Efficiency: ByPE-VAE reduces training time per epoch by approximately 3×3\times relative to Exemplar VAE—13.2 s vs. 35.5 s (Dynamic MNIST, M=500M=500 vs. N=25000N=25\,000 components).
  • Latent Representations: t-SNE visualization of MNIST embeddings reveals tighter clusters and improved inter-class separation. kNN classification accuracy (on learned codes) is consistently higher under ByPE-VAE for all k{3,5,,15}k\in\{3,5,\dots,15\}.
  • Data Augmentation: Discriminative test error on permutation-invariant MNIST with augmented samples: ByPE-VAE achieves 1.10%1.10\% (posterior sampling), 0.88%0.88\% (prior) vs. 1.16%/1.10%1.16\%/1.10\% for Exemplar.
  • Ablation Analyses: Performance remains stable as pseudocoreset update interval kk increases up to 50; ByPE-VAE outperforms Exemplar variants with equal coreset size.

Empirical Comparisons

Method Dyn MNIST (NLL) Fash MNIST (NLL) CIFAR-10 (NLL) Training Time/Epoch (Dyn MNIST, s)
VAE+Gauss prior 24.41 21.43 72.21
VAE+VampPrior 23.65 20.87 71.97 13.0
VAE+Exemplar 23.83 21.00 72.55 35.5
ByPE-VAE 23.61 20.85 71.91 13.2

This demonstrates that ByPE-VAE matches or exceeds benchmarks on both data efficiency and sample quality (Ai et al., 2021).

5. Analysis of Computational and Statistical Properties

Computational Efficiency: ByPE-VAE's computational savings result from replacing a full sum over NN exemplars in the prior with a sum over MNM \ll N learned pseudopoints. Because pseudocoreset updates are amortized (run every kk epochs), the additional cost is negligible relative to the overall VAE training cycle.

Regularization and Overfitting Avoidance: Pseudocoreset construction acts as an implicit regularizer. Only a small set of points and their weights are used to assemble the mixture prior, mitigating risks of overfitting or memorization associated with full-dataset-dependent priors.

Optimization and Sensitivity: The KL-based pseudocoreset adaptation is sensitive to hyperparameters MM, kk, and SS, as well as initialization; if MM is too small or initialization is poor, the coreset may underfit the true data distribution.

6. Limitations and Possible Extensions

While ByPE-VAE achieves substantial improvements, limitations remain:

  • Coreset Update Overhead: The requisite stochastic updates for pseudocoreset points and weights still incur nontrivial cost (proportional to positive integers S50S\sim50, B100B\sim100, M500M\sim500).
  • Hyperparameter Sensitivity: Careful selection of MM, kk, and learning rates γt\gamma_t is necessary to avoid degraded performance.
  • Potential Underfitting: If the coreset size MM is too small, the induced prior may not adequately match the full-data mixture, especially for complex data or highly multi-modal latent structures.

Proposed extensions and areas for future work include:

  • Amortized pseudocoresets: replacing static pseudopoints with a generative model of umu_m.
  • Hierarchical pseudocoreset priors: employing clusters within clusters to further enhance prior expressiveness.
  • Application to advanced likelihood models: incorporating pseudocoresets into normalizing flow architectures.
  • Dynamic coreset adjustment: growing or pruning MM dynamically during training to maintain model flexibility.

A plausible implication is that pseudocoresets could be adapted for other Bayesian deep learning settings that benefit from expressive, computationally efficient data-dependent priors.

7. Context and Broader Impact in Variational Learning

ByPE-VAE demonstrates that judiciously optimizing a lightweight, learnable pseudocoreset can enable practical approximation of highly flexible, data-dependent mixture priors, expanding the toolkit available for expressive probabilistic modeling without incurring prohibitive costs. Its success highlights the tradeoff between model complexity, inference tractability, and regularization—a central concern for variational learning frameworks. The methodology is broadly applicable across density estimation, unsupervised embedding, and generative modeling, and serves as a foundation for ongoing research into scalable, highly expressive Bayesian generative models (Ai et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByPE-VAE.