Variational Deep Embedding (VaDE)

Updated 11 March 2026

VaDE is an unsupervised generative clustering framework that integrates variational autoencoders with a Gaussian mixture model for end-to-end learning.
It optimizes the evidence lower bound (ELBO) to learn latent representations and soft cluster assignments concurrently.
VaDE achieves strong performance on benchmarks like MNIST, significantly improving clustering accuracy over traditional methods.

Variational Deep Embedding (VaDE) is an unsupervised, generative clustering framework that combines the deep latent variable modeling capacity of the Variational Autoencoder (VAE) with explicit clustering enforced through a Gaussian Mixture Model (GMM) prior in the latent space. VaDE supports end-to-end optimization of both representation learning and cluster assignments within a unified probabilistic framework, enabling principled generative modeling, cluster discovery, and sample generation conditioned on cluster identity (Jiang et al., 2016).

1. Probabilistic Framework and Generative Model

VaDE specifies a joint generative process for a data vector $\mathbf{x} \in \mathbb{R}^D$ comprising three stages:

Latent Cluster Assignment: Draw cluster $c \in \{1,\ldots,K\}$ from a categorical prior $p(c) = \mathrm{Cat}(c|\boldsymbol{\pi})$ , where $\sum_{k=1}^K \pi_k = 1$ .
Cluster-Specific Latent Variable: Given $c$ , sample latent code $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}_c, \operatorname{diag}(\boldsymbol{\sigma}_c^2))$ .
Data Generation: Generate $\mathbf{x}$ from a decoder network conditioned on $\mathbf{z}$ , i.e., $p(\mathbf{x}|\mathbf{z}) = \mathrm{Decoder}_\theta(\mathbf{z})$ , with the likelihood being either Gaussian or Bernoulli depending on the data.

The aggregated generative model is characterized by the joint: $p(\mathbf{x}, \mathbf{z}, c) = p(c) \, p(\mathbf{z}|c) \, p(\mathbf{x}|\mathbf{z})$ This joint model forms the basis for both unsupervised clustering and data generation from specified clusters (Jiang et al., 2016, Luo et al., 2020).

2. Variational Inference and Optimization Objective

The true posterior $p(\mathbf{z}, c|\mathbf{x})$ is intractable. VaDE employs a mean-field variational approximation: $q(\mathbf{z}, c|\mathbf{x}) = q(\mathbf{z}|\mathbf{x}) \, q(c|\mathbf{x})$ The inference network comprises:

Encoder $q_\phi(\mathbf{z}|\mathbf{x})$ : A neural network outputs $\tilde{\boldsymbol{\mu}}(\mathbf{x})$ and $\log \tilde{\boldsymbol{\sigma}}^2(\mathbf{x})$ , parametrizing a Gaussian.
Posterior over clusters $q(c|\mathbf{x})$ : Approximated via expectation over latent samples or, equivalently, $q(c|\mathbf{x}) \approx p(c|\mathbf{z})$ averaged over $\mathbf{z} \sim q(\mathbf{z}|\mathbf{x})$ .

The optimization target is the evidence lower bound (ELBO): $\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_{q(\mathbf{z}, c|\mathbf{x})} [\log p(\mathbf{x}|\mathbf{z})] - \mathrm{KL} \bigl( q(\mathbf{z}, c|\mathbf{x}) \parallel p(\mathbf{z}, c) \bigr)$ which further expands as: $\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\bigl[\log p(\mathbf{x}|\mathbf{z})\bigr] - \sum_{c=1}^K q(c|\mathbf{x}) \, \mathrm{KL} \left(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}|c)\right) - \mathrm{KL} \left( q(c|\mathbf{x}) \| \pi \right)$ The ELBO incentivizes the encoder to produce latent codes that are well-clustered by the GMM prior and preserve high-quality reconstructions (Jiang et al., 2016, Macarie-Ancau et al., 23 Sep 2025).

3. Network Architecture and Training Procedure

Typical implementations use fully connected neural networks for both encoder and decoder. The encoder $g(\cdot;\phi)$ and decoder $f(\cdot;\theta)$ generally use multiple hidden layers with ReLU activations. For example:

Encoder: $D \rightarrow 500 \rightarrow 500 \rightarrow 2000 \rightarrow J$ (outputting both mean and log-variance).
Decoder: $J \rightarrow 2000 \rightarrow 500 \rightarrow 500 \rightarrow D$ .

Latent dimension $J$ is problem-dependent (commonly 10 for MNIST-class datasets; $d$ in other applications is set between 10-20). Mixture component number $K$ matches the target cluster count.

Optimization employs the Adam optimizer, frequently with batch sizes 100–1000 and learning rates in the $10^{-4}$ – $10^{-3}$ range. To improve stability, the encoder/decoder may be pretrained as a standard autoencoder, and a GMM fit to $\tilde{\boldsymbol{\mu}}(\mathbf{x})$ used to initialize $\{\pi_c, \mu_c, \sigma_c^2\}$ (Jiang et al., 2016, Luo et al., 2020, Macarie-Ancau et al., 23 Sep 2025).

In "Deep Clustering for Blood Cell Classification and Quantification" (Macarie-Ancau et al., 23 Sep 2025), the VaDE module re-uses the standard VAE architecture from the earlier pipeline stage, with the only substantive change being the use of a Gaussian mixture prior with $K=3$ for the RBC subtypes.

4. Inference and Cluster Assignment

At test time, cluster assignment for a point $\mathbf{x}$ uses the variational posterior $q(c|\mathbf{x})$ . This is computed by first encoding $\mathbf{x}$ to $\tilde{\boldsymbol{\mu}}(\mathbf{x})$ and evaluating: $q(c|\mathbf{x}) \approx p(c|z=\tilde{\boldsymbol{\mu}}(\mathbf{x})) = \frac{\pi_c \, \mathcal{N}(\tilde{\boldsymbol{\mu}}(\mathbf{x})|\mu_c, \Sigma_c)} {\sum_{j=1}^K \pi_j \, \mathcal{N}(\tilde{\boldsymbol{\mu}}(\mathbf{x})|\mu_j, \Sigma_j)}$ This "soft" cluster assignment provides probabilistic labels for downstream analysis or aggregation (Jiang et al., 2016, Macarie-Ancau et al., 23 Sep 2025). In applications lacking cell-level ground truth, these scores serve as cluster proxies.

5. Empirical Performance and Extensions

VaDE exhibits strong performance on standard clustering tasks. For example, on the MNIST dataset (10 clusters), VaDE achieves 94.46% unsupervised accuracy, substantially outperforming GMM (53.7%), AE+GMM (82.2%), and other deep clustering methods including DEC (84.3%) (Jiang et al., 2016). Comparable improvements are reported for text (Reuters-10K) and multimodal data.

As a consequence of its generative formulation, VaDE supports unconditional and cluster-conditional sample generation. Given cluster index $c$ , one samples $\mathbf{z} \sim \mathcal{N}(\mu_c, \sigma_c^2 I)$ , then $\mathbf{x} \sim p(\mathbf{x}|\mathbf{z})$ .

VaDE serves as a foundation for further refinement. The Locality Preserving Variational Discriminative Network (LPVDN) (Luo et al., 2020) introduces an adversarial robust embedding discriminator and a locality‐preserving constraint, jointly improving clustering quality and robustness to noise. On MNIST, LPVDN raises accuracy from 94.5% (VaDE) to 97.2%, and maintains high accuracy under substantial input noise, whereas VaDE degrades (Luo et al., 2020).

Model	MNIST ACC	Fashion-MNIST ACC	Reuters-10K ACC
VaDE	94.5%	63.8%	80.9%
LPVDN	97.2%	68.4%	83.0%

Performance metrics directly as reported in (Luo et al., 2020).

6. Applications and Case Studies

VaDE has been adapted for specialized domains with limited or noisy supervision. In "Deep Clustering for Blood Cell Classification and Quantification" (Macarie-Ancau et al., 23 Sep 2025), VaDE is used to cluster red blood cell (RBC) events into $K=3$ biologically informed subtypes: normal, reticulocyte, and clumped. The architecture is standard except for this domain-specific choice of cluster number. Due to the absence of cell-level ground truth, only qualitative assessment is available, but clear separation of subtypes is observed in the learned embedding.

In general, VaDE has demonstrated competitive or superior results in vision, text, biomedical, and sensor data, including benchmark datasets such as MNIST, Fashion-MNIST, Reuters, HHAR, and STL-10 (Jiang et al., 2016, Luo et al., 2020, Macarie-Ancau et al., 23 Sep 2025).

7. Limitations and Future Directions

Known limitations of the baseline VaDE are its sensitivity to input perturbations and insufficient incorporation of local geometric structure. LPVDN addresses these by integrating a robust mutual-information discriminator and a t-SNE–style neighborhood loss, resulting in improved robustness and fidelity of local structure in the learned representation (Luo et al., 2020). LPVDN, however, introduces additional complexity and scalability constraints due to pairwise affinity computation.

Generality of the VaDE framework extends beyond GMM priors; mixture laws such as Laplace, Student's t, or even nonparametric Dirichlet process mixtures are admissible, provided the revised KL-divergence terms remain tractable (Jiang et al., 2016). Open research directions include online or approximate neighborhood graphs, adaptive multi-scale loss weighting, and extensions to mixed data modalities (Luo et al., 2020). In domains with no cluster labels, expert curation or partial annotation will be needed for further quantitative assessment of clustering accuracy and subtype discovery (Macarie-Ancau et al., 23 Sep 2025).