Papers
Topics
Authors
Recent
2000 character limit reached

Gaussian Mixture VQ: Probabilistic Vector Quantization

Updated 3 January 2026
  • Gaussian Mixture VQ (GM-VQ) is a probabilistic vector quantization framework that employs a Gaussian mixture prior to map continuous representations to discrete codebooks, improving generative modeling.
  • It introduces an aggregated categorical posterior ELBO (ALBO) that replaces per-sample entropy with the entropy of the aggregated posterior, ensuring effective codebook utilization.
  • GM-VQ optimizes the network end-to-end without separate commitment losses, demonstrating significant reduction in reconstruction error and robust performance on benchmarks like CIFAR-10 and CelebA.

Gaussian Mixture Vector Quantization (GM-VQ) is a probabilistic framework extending vector quantized variational autoencoders (VQ-VAE) for mapping continuous representations to discrete codebooks, crucial for generative modeling, information bottlenecking, and discrete tokenization in machine learning. GM-VQ introduces a Gaussian mixture as the generative prior, utilizes adaptive variances to capture complex data structure, and replaces heuristic objectives in prior VQ-VAE works with a unified Bayesian optimization approach. The introduction of the Aggregated Categorical Posterior Evidence Lower Bound (ALBO) enables improved codebook utilization and reduced reconstruction error without reliance on handcrafted regularization or codebook management heuristics (Yan et al., 2024).

1. Probabilistic Generative Model and Inference

GM-VQ defines a hierarchical latent variable model comprising a categorical latent code c{1,,C}c \in \{1, \dots, C\} with prior p(c)=πcp(c) = \pi_c, typically uniform (πc=1/C\pi_c = 1/C). Conditional on cc, a continuous latent zRLz \in \mathbb{R}^L is drawn from an isotropic Gaussian: p(zc)=N(z;μc,σz2I)p(z|c) = \mathcal{N}(z; \mu_c, \sigma_z^2 I), where {μc}c=1C\{\mu_c\}_{c=1}^C serves as the codebook of latent means and σz2\sigma_z^2 is a fixed variance hyper-parameter (the limit σz20\sigma_z^2 \to 0 recovers deterministic VQ). Observed data xRDx \in \mathbb{R}^D are generated by a decoder DθD_\theta as p(xz)=N(x;Dθ(z),σx2I)p(x|z) = \mathcal{N}(x; D_\theta(z), \sigma_x^2 I). The joint generative model factorizes as p(x,z,c)=p(xz)p(zc)p(c)p(x, z, c) = p(x|z) p(z|c) p(c).

Inference is performed via a variational posterior q(c,zx)=q(cx)q(zc,x)q(c, z|x) = q(c|x) q(z|c, x). The discrete component q(cx)q(c|x) is parameterized with categorical logits αc(x)\alpha_c(x) derived from a Mahalanobis‐style distance between an encoder “proxy” h^(x)\hat{h}(x) and codebook entry μc\mu_c, scaled by an adaptive per-sample weight w(x)=Softplus(w~(x))w(x) = \operatorname{Softplus}(\tilde{w}(x)):

αc(x)=12(h^(x)μc)diag[w(x)](h^(x)μc)\alpha_c(x) = -\frac{1}{2} (\hat{h}(x) - \mu_c)^{\top} \operatorname{diag}[w(x)] (\hat{h}(x) - \mu_c)

πc(x)=exp[αc(x)]cexp[αc(x)]\pi_c(x) = \frac{\exp[\alpha_c(x)]}{\sum_{c'} \exp[\alpha_{c'}(x)]}

The continuous posterior q(zc,x)=N(z;μc,Σc(x))q(z|c, x) = \mathcal{N}(z; \mu_c, \Sigma_c(x)) uses a diagonal covariance Σc(x)=σc2(x)I\Sigma_c(x) = \sigma_c^2(x) I, with adaptive variance

σc2(x)=(h^(x)μc2L)/(2σz2)\sigma_c^2(x) = \left( \frac{ \| \hat{h}(x) - \mu_c \|^2 }{L} \right) / (2 \sigma_z^2)

This yields a “soft-assignment” of xx to each codeword, alongside adaptive uncertainty per assignment.

2. Aggregated Categorical Posterior ELBO (ALBO)

GM-VQ introduces an alternative to the standard evidence lower bound (ELBO) by aggregating the categorical posterior over data. Standard ELBOs include the term H[q(cx)]-H[q(c|x)], incentivizing high entropy that conflicts with effective low-temperature Gumbel-Softmax sampling. GM-VQ replaces this per-example entropy with entropy of the aggregated posterior q(c)=Expdata[q(cx)]q(c) = \mathbb{E}_{x \sim p_{\text{data}}}[q(c|x)], yielding the ALBO objective:

EALBO=ExpdataEq(cx)q(zc,x)[logp(x,z,c)logq(c)]E_{ALBO} = \mathbb{E}_{x \sim p_{\text{data}}} \mathbb{E}_{q(c|x) q(z|c, x)} [ \log p(x, z, c) - \log q(c) ]

This objective aligns the variational posterior with the generative model and eliminates the adverse entropy component.

3. Loss Function and Optimization

The practical GM-VQ loss is formulated as:

LGMVQ=ExEcx[xDθ(z(c,x))2]+γ(ExEcx[zˉ(c,x)μc2]+βKL[q(c)p(c)])L_{GM-VQ} = \mathbb{E}_x \mathbb{E}_{c|x}[ \| x - D_\theta(z(c, x)) \|^2 ] + \gamma \left( \mathbb{E}_x \mathbb{E}_{c|x}[ \| \bar{z}(c, x) - \mu_c \|^2 ] + \beta \, KL[q(c) \| p(c)] \right)

Here, z(c,x)=μc+σc(x)ϵz(c, x) = \mu_c + \sigma_c(x) \odot \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and γ\gamma, β\beta are positive hyper-parameters absorbing fixed decoder and latent variances (σx2\sigma_x^2, σz2\sigma_z^2). This loss is optimized fully end-to-end without handcrafted commitment losses or codebook management steps.

4. Training Procedure

The GM-VQ training pipeline for each mini-batch involves:

  • Encoding xx via the encoder to produce (h^(x),w(x))(\hat{h}(x), w(x)), compute Mahalanobis-style logits αc(x)\alpha_c(x), and evaluate soft assignments πc(x)\pi_c(x).
  • Sampling “soft” one-hot vectors c~\tilde{c} using Gumbel-Softmax(π(x),τ)(\pi(x), \tau), with annealing τ\tau from $2.0$ to $0.1$ throughout training.
  • Drawing z=c~M+σc(x)ϵz = \tilde{c}^{\top}M + \sigma_c(x) \odot \epsilon for straight-through approximation.
  • Decoding to x^=Dθ(z)x̂ = D_\theta(z).
  • Computing LGMVQL_{GM-VQ} and backpropagating gradients through parameters θ\theta, ϕ\phi, and codebook MM directly.
  • Optimization is performed via AdamW with a cosine schedule and linear warm-up.

No separate exponential moving average, commitment loss, or post-hoc clustering is necessary.

5. Empirical Evaluation

Experimental results on CIFAR-10 and CelebA benchmark datasets demonstrate substantial performance improvement by GM-VQ over previous VQ-VAE approaches:

Model MSE (CIFAR-10 / CelebA) Perplexity (CIFAR-10 / CelebA)
VQ-VAE 5.65 / 10.02 14.0 / 16.2
VQ-VAE + replace 4.07 / 4.77 109.8 / 676.4
GM-VQ 3.13 / 1.38 731.9 / 338.6
GM-VQ + Entropy 3.11 / 0.97 878.7 / 831.0

GM-VQ achieves approximately 50% reduction in reconstruction error relative to vanilla VQ-VAE and attains high codebook perplexity, indicating usage of nearly the full codebook (perplexity C\gg C). Entropy regularization parameter β\beta produces higher perplexity and slightly lower MSE as β\beta increases, indicating stronger codebook utilization.

6. Analysis, Extensions, and Limitations

By adopting a Gaussian mixture prior over latent codes and maintaining learnable, small conditional variances, GM-VQ generalizes deterministic quantization to a fully probabilistic model in which codebook entries, mixing weights, and variances are optimized under a unified objective. The aggregated categorical posterior mitigates conflicts arising in gradient estimation when using Gumbel-Softmax and high-entropy code assignments; the harmful H[q(cx)]-H[q(c|x)] term is removed from the training objective. GM-VQ eliminates the need for post-hoc codebook replacement heuristics, commitment losses, and cluster management found in previous VQ-VAE frameworks. Code collapse is naturally prevented.

Potential extensions include end-to-end learning of σz\sigma_z and mixture weights πc\pi_c, as well as architectural hierarchies formed by stacked GM-VQ layers. Key limitations remain: decoder variance σx2\sigma_x^2 is fixed, effectiveness depends on the accuracy of Gumbel-Softmax straight-through approximation, and further bias reduction may be achievable via control variates. A plausible implication is that the principled Bayesian framework enables broadening of quantized representations without sacrificing tractable training or differentiation (Yan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Gaussian Mixture VQ (GM-VQ).