2000 character limit reached

Decomposition-based Quantized VAE

Updated 18 August 2025

The paper introduces DQ-VAE to decompose the latent space into meaningful components and quantize them independently, achieving exponential capacity with linear codebook growth.
DQ-VAE employs strategies like feature slicing, semantic segmentation, and scalar factorization to enhance disentanglement, interpretability, and controlled information bottlenecks.
Empirical results show that DQ-VAE improves reconstruction, compression, and representation quality across domains such as computer vision, graphs, and speech.

A Decomposition-based Quantized Variational AutoEncoder (DQ-VAE) is a class of deep generative models that extends traditional variational autoencoders by introducing explicit decomposition of the latent space and applying quantization at the level of individual latent components or factorized feature axes. This approach is motivated by the need to capture richer, more interpretable, and higher-capacity discrete latent representations, and to provide improved control over information bottlenecks and disentanglement in both generative modeling and downstream tasks.

1. Core Principles of DQ-VAE

DQ-VAE builds upon the foundation of Vector-Quantized Variational Autoencoders (VQ-VAE) (Oord et al., 2017), which use vector quantization to enforce discrete latent representations by mapping the encoder output to the nearest learned codebook entry. The key innovation in DQ-VAE is to further decompose the latent space into statistically or semantically meaningful components and quantize these independently, typically with separate codebooks per factor, feature slice, or semantic region. The decomposition can be along the feature axis (depthwise), spatial segments, semantic parts (as in articulated objects), or scalar factorization (with per-latent scalar quantization) (Fostiropoulos, 2020, Fostiropoulos et al., 2022, Zhao et al., 19 Jul 2024, Baykal et al., 23 Sep 2024).

The resulting bottleneck is thus “decomposed” and “quantized,” yielding a compositional, factorized discrete representation in which the model’s expressiveness increases exponentially with the number of factors, but the codebook parameter count grows only linearly (Fostiropoulos, 2020, Fostiropoulos et al., 2022). Explicit decomposition also supports further objectives:

disentanglement of latent factors,
improved codebook utilization,
enhanced interpretability, and
controlled information regularization.

2. Architectural Strategies and Loss Functions

There are several principal decomposition strategies:

Feature-Axis or Channel Decomposition: The latent tensor zₑ ∈ ℝ^D×w×h is split into L disjoint slices along its feature dimension. Each slice zₙ is quantized with its own independent codebook Cₙ (Fostiropoulos, 2020, Fostiropoulos et al., 2022).
Semantic or Structural Segmentation: In domain-specific tasks, such as human grasp synthesis, the input (e.g., a hand mesh) is split into semantic parts (e.g., fingers and palm), each encoded and quantized independently using part-specific codebooks (Zhao et al., 19 Jul 2024).
Scalar Latent Decomposition: Each element of a vectorized latent representation is quantized individually by mapping it to a scalar value from a shared global codebook (scalar quantization) (Baykal et al., 23 Sep 2024).

Quantization for each subspace or slice typically uses vector quantization: $z_{q,i} = e_{k^*}, \quad k^* = \arg\min_j \| z_i - e_j \|_2$ where $z_i$ is the i-th latent component/slice and $e_j$ are the corresponding codebook entries.

The loss functions for DQ-VAE models are a direct extension of the VQ-VAE loss: $\mathcal{L} = - \log p_\phi(x|z_q) + \sum_{i=1}^L \| \mathrm{sg}(z_i) - z_{q,i} \|_2^2 + \beta \sum_{i=1}^L \| z_i - \mathrm{sg}(z_{q,i}) \|_2^2$ where $\mathrm{sg}$ denotes the stop-gradient operation and $\beta$ is a commitment weight (Fostiropoulos, 2020). In advanced formulations, additional terms account for entropy regularization, total correlation penalties for disentanglement (Baykal et al., 23 Sep 2024), or skeletal/physical constraints in pose modeling (Zhao et al., 19 Jul 2024).

Hierarchical and soft quantization strategies are also prominent:

Hierarchical Quantization: Latent codes are structured into multiple levels (e.g., coarse-to-fine), with separate quantization and priors at each level (Duan et al., 2022, Fostiropoulos et al., 2022, Zeng et al., 17 Apr 2025).
Soft or Stochastic Quantization: Rather than hard nearest-neighbor assignments, a Bayesian or probabilistic mechanism yields soft code assignments for each latent, with the possibility of self-annealed stochastic-to-deterministic transitions during training (Wu et al., 2019, Takida et al., 2022).

3. Information-Theoretic and Regularization Perspectives

The DQ-VAE objective can be interpreted through the lens of the variational information bottleneck (VIB) and entropy decomposition frameworks (Wu et al., 2018, Lygerakis et al., 9 Jul 2024). The key constituents are:

Reconstruction Term: Enforces preservation of input data through the quantized latent.
Rate/Regularization Term(s):
- Entropy and cross-entropy between the learned latent distribution and (possibly non-trivial) priors (Lygerakis et al., 9 Jul 2024), which can be tailored independently for each decomposed latent slice.
- Mutual information or total correlation penalties encouraging independence and disentanglement across latent dimensions (Baykal et al., 23 Sep 2024).
- Codebook usage regularization (e.g., maximizing codebook perplexity or penalizing imbalance).

For instance, the Entropy Decomposed VAE (ED-VAE) generalizes the ELBO: $\mathcal{L}(\theta, \phi) = \mathbb{E}_{q(z|x)}[\log p_\theta(x|z)] - I_q(x, z) + H[q(z)] - H[q(z), p(z)]$ where $H[q(z)]$ is the latent entropy and $H[q(z), p(z)]$ is the cross-entropy with the prior, affording precise control over information content and regularization per component (Lygerakis et al., 9 Jul 2024).

In DQ-VAE, this perspective translates into explicit, component-wise balancing of reconstruction fidelity, codebook entropy, and alignment with (potentially structured) priors.

4. Training Techniques and Codebook Management

Robust training of DQ-VAE models—particularly with large or multiple codebooks—is nontrivial (Łańcucki et al., 2020, Zeng et al., 17 Apr 2025). Strategies empirically shown to improve codebook utilization and representation quality include:

Learning Rate Scheduling: Increasing the learning rate for codebook vectors facilitates adaptation to the (often rapidly changing) encoder output distribution (Łańcucki et al., 2020).
Batch Normalization: Normalizing encoder outputs prior to quantization ensures scale consistency with codebook vectors, improving the angular similarity and codebook usage.
Data-Dependent Reinitialization: Periodically resetting codebooks based on recent encoder activations (e.g., k-means++ on reservoir-sampled activations) avoids dead codes and adapts to nonstationary dynamics (Łańcucki et al., 2020).
Annealing-Based Code Selection: Applying softmax or probabilistic code selection with a decaying temperature parameter encourages broad code exploration early in training and refinement toward optimal codes later (Zeng et al., 17 Apr 2025).
Hierarchical Codebooks: Stacking codebooks in multiple layers (e.g., a “code for the codes” second layer) helps encode relationships among codewords and addresses sparsity in codebook space (Zeng et al., 17 Apr 2025).

Proper management of these elements is crucial for avoiding common problems such as codebook underutilization or collapse, which can otherwise limit representational power and degrade generative performance.

5. Empirical Results and Applications

DQ-VAE models have achieved strong empirical results across a diverse set of domains, including:

Computer Vision: On CIFAR-10, decomposed quantization (depthwise/DQ) yields up to 33% improved reconstruction over single-codebook VQ-VAE, with bits/dim scores close to state-of-the-art autoregressive methods (Fostiropoulos, 2020, Fostiropoulos et al., 2022).
Image Compression: Hierarchical quantized VAEs can achieve superior rate-distortion trade-offs and outperform JPEG over varying bitrates, while supporting fast, parallel encoding/decoding (Yang et al., 2020, Duan et al., 2022).
Structured Data and Graphs: Hierarchical vector quantized graph autoencoders surpass 16 established baselines in link prediction and node classification by effectively capturing topology with discrete, compositionally structured codes (Zeng et al., 17 Apr 2025).
Disentangled Representation Learning: Scalar quantized DQ-VAE models regularized by total correlation dominate both DCI and InfoMEC disentanglement metrics, and provide robust, interpretable decompositions of factors in image datasets (Baykal et al., 23 Sep 2024).
Human Grasp Synthesis: Decomposing latent space by hand regions and employing dual-stage decoding with skeletal constraints enhances grasp realism, diversity, and physical plausibility, achieving a 14.1% quality index improvement over previous methods (Zhao et al., 19 Jul 2024).
Speech and Audio: DQ-VAE principles transfer to temporal domains, enabling robust unsupervised phoneme discovery and generative modeling (Oord et al., 2017, Łańcucki et al., 2020).

These results are typically measured using dataset-specific metrics: reconstruction loss, bits/dim, Fréchet Inception Distance (FID), cluster entropy, codebook perplexity, and domain-targeted metrics (e.g., grasp contact ratio and quality index, link prediction AUC).

6. Limitations, Challenges, and Future Directions

Despite clear advantages, DQ-VAE models face several open challenges:

Codebook Management: As the number of decomposed factors or feature dimensions grows, ensuring efficient codebook utilization and avoiding code collapse remain critical (Zeng et al., 17 Apr 2025, Takida et al., 2022).
Assumption of Independence: Decomposition often assumes statistical independence across slices/parts, which may be violated in strongly correlated domains (e.g., spatial correlations in images), potentially limiting performance (Fostiropoulos, 2020).
Disentanglement in Complex Data: Achieving full, robust disentanglement in real-world or highly structured datasets (e.g., MPI3D) is an open problem, even for advanced DQ-VAE architectures (Baykal et al., 23 Sep 2024).
Integration with Autoregressive Priors: Combining powerful autoregressive priors and decomposed quantization requires careful modeling to avoid interference and maximize generative capacity (Fostiropoulos, 2020, Duan et al., 2022).
Efficient Regularization: Properly regularizing each decomposed latent channel in a way that balances diversity, information content, and codebook alignment is a focus of ongoing theoretical and empirical research (Wu et al., 2018, Lygerakis et al., 9 Jul 2024).

Further promising directions include extending hierarchical or structured codebooks beyond two layers, adaptive quantization grid refinement informed by posterior uncertainty (Yang et al., 2020), and integration with emerging generative modeling frameworks leveraging optimal transport or entropy decomposition principles (Vuong et al., 2023, Lygerakis et al., 9 Jul 2024).

7. Summary Table: Decomposition Strategies and Key Outcomes

Decomposition Strategy	Codebook Allocation	Reported Benefits
Feature/Depthwise Slicing (Fostiropoulos, 2020, Fostiropoulos et al., 2022)	Per-feature/channel	Exponential representational capacity, improved bits/dim, better likelihood
Semantic/Part-Based (Zhao et al., 19 Jul 2024)	Per semantic region (e.g., part)	Higher realism/diversity in structured outputs, faster inference
Scalar Factorization (Baykal et al., 23 Sep 2024)	Global scalar codebook	Stronger disentanglement, improved DCI/InfoMEC, interpretable latent space
Hierarchical/Coarse-to-Fine (Duan et al., 2022, Zeng et al., 17 Apr 2025)	Per level/layer	Adaptive detail allocation, better compression, improved graph structure capture

This table catalogues representative DQ-VAE architectures, highlighting their core method of decomposition, mode of codebook assignment, and salient empirical advantages as established in the literature.

References

(Oord et al., 2017) "Neural Discrete Representation Learning"
(Wu et al., 2018) "Variational Information Bottleneck on Vector Quantized Autoencoders"
(Wu et al., 2019) "Quantization-Based Regularization for Autoencoders"
(Yang et al., 2020) "Variational Bayesian Quantization"
(Fostiropoulos, 2020) "Depthwise Discrete Representation Learning"
(Łańcucki et al., 2020) "Robust Training of Vector Quantized Bottleneck Models"
(Fostiropoulos et al., 2022) "Implicit Feature Decoupling with Depthwise Quantization"
(Takida et al., 2022) "SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization"
(Duan et al., 2022) "Lossy Image Compression with Quantized Hierarchical VAEs"
(Vuong et al., 2023) "Vector Quantized Wasserstein Auto-Encoder"
(Lygerakis et al., 9 Jul 2024) "ED-VAE: Entropy Decomposition of ELBO in Variational Autoencoders"
(Zhao et al., 19 Jul 2024) "Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation"
(Baykal et al., 23 Sep 2024) "Disentanglement with Factor Quantized Variational Autoencoders"
(Zeng et al., 17 Apr 2025) "Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection"