Contrastive Variational Autoencoder (cVAE)
- Contrastive Variational Autoencoder (cVAE) is a generative modeling framework that utilizes contrastive mechanisms to isolate target-specific features from background variations.
- It separates latent space into salient and shared components using an augmented ELBO, KL divergence penalties, and contrastive regularization to enhance representation.
- cVAE is applied in fields like biomedicine, speech processing, and sequential recommendation to improve model interpretability and performance.
A Contrastive Variational Autoencoder (cVAE) is a generative modeling framework that extends variational autoencoding by incorporating explicit contrastive mechanisms to isolate, disentangle, or enrich latent factors that are uniquely salient in a "target" dataset relative to a "background" or to a set of negative examples. cVAE models provide a principled and scalable approach for contrastive representation learning within the probabilistic modeling paradigm, and are especially applied in settings ranging from biomedicine to sequential recommendation and speech processing. This paradigm encompasses multiple architectural and training variants, united by the common goal of separating shared from salient sources of variation via contrastive objectives or explicit dataset splitting.
1. Motivations and Contrastive Principle
The cVAE paradigm is motivated by scientific and practical tasks where one wishes to extract variation present in a "target" dataset that is absent or diminished in a "background" or reference dataset, or simply disentangle distinct latent factors within observed data. Standard VAEs, while effective at reconstructing data and learning compressed representations, tend to model the dominant axes of variation in the marginal data distribution, potentially ignoring rare, task-salient, or group-specific factors. Contrastive learning reframes this challenge by leveraging a reference distribution (or an implicit notion of negative examples) to "subtract" away nuisance variation, forcing the model to allocate representational capacity to the target-relevant signals (Abid et al., 2019, Weinberger et al., 2022).
2. Formal cVAE Generative Models
At the core of classical cVAE formulations is an explicit separation of latent variables into components that are "shared" across both datasets and "salient" to the target dataset. The canonical generative model is as follows (Abid et al., 2019):
- For target sample :
- For background sample :
Here, denotes salient (target-specific) latent factors, and denotes shared latent factors. The decoder receives both and for target samples, but only (with fixed to zero) for background data, enforcing that background samples cannot utilize the salient subspace for reconstruction (Abid et al., 2019, Weinberger et al., 2022).
Generalizations include structured Gaussian mixture priors, split-branch architectures for other modalities, and sequence settings with static/dynamic partitions in sequential data (Bai et al., 2021, Ebbers et al., 2020).
3. Learning Objectives and Contrastive Regularization
cVAE models optimize an augmented ELBO that includes components for both target and background samples:
- For target:
0
- For background:
1
The ELBO is then summed over both datasets. Explicit independence between 2 and 3 is enforced via total correlation penalties (estimated by the density ratio trick or MMD) (Weinberger et al., 2022, Louiset et al., 2023). More advanced variants