EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders (2310.05718v3)
Abstract: Codebook collapse is a common problem in training deep generative models with discrete representation spaces like Vector Quantized Variational Autoencoders (VQ-VAEs). We observe that the same problem arises for the alternatively designed discrete variational autoencoders (dVAEs) whose encoder directly learns a distribution over the codebook embeddings to represent the data. We hypothesize that using the softmax function to obtain a probability distribution causes the codebook collapse by assigning overconfident probabilities to the best matching codebook elements. In this paper, we propose a novel way to incorporate evidential deep learning (EDL) instead of softmax to combat the codebook collapse problem of dVAE. We evidentially monitor the significance of attaining the probability distribution over the codebook embeddings, in contrast to softmax usage. Our experiments using various datasets show that our model, called EdVAE, mitigates codebook collapse while improving the reconstruction performance, and enhances the codebook usage compared to dVAE and VQ-VAE based models. Our code can be found at https://github.com/ituvisionlab/EdVAE .
- D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013).
- Generative adversarial nets, Advances in neural information processing systems 27 (2014) 2672–2680.
- D. J. Rezende, S. Mohamed, Variational inference with normalizing flows, in: International Conference on Machine Learning, 2015, pp. 1530–1538.
- Denoising diffusion probabilistic models, in: Advances in Neural Information Processing Systems, 2020, pp. 6840–6851.
- High-resolution image synthesis with latent diffusion models, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021) 10674–10685.
- Zero-shot text-to-image generation, ArXiv abs/2102.12092 (2021).
- Hierarchical text-conditional image generation with clip latents, ArXiv abs/2204.06125 (2022).
- beta-vae: Learning basic visual concepts with a constrained variational framework, in: International Conference on Learning Representations, 2017.
- H. Kim, A. Mnih, Disentangling by factorising, in: International Conference on Machine Learning, 2018, pp. 2568–2577.
- Categorical reparameterization with gumbel-softmax, in: International Conference on Learning Representations, 2017.
- The concrete distribution: A continuous relaxation of discrete random variables, in: International Conference on Learning Representations, 2017.
- J. T. Rolfe, Discrete variational autoencoders, ArXiv abs/1609.02200 (2016).
- Neural discrete representation learning, in: Advances in Neural Information Processing Systems, 2017, pp. 6306–6315.
- A. Vahdat, J. Kautz, NVAE: A deep hierarchical variational autoencoder, in: Neural Information Processing Systems (NeurIPS), 2020.
- R. Child, Very deep vaes generalize autoregressive models and can outperform them on images, ArXiv abs/2011.10650 (2020).
- Generating diverse high-fidelity images with vq-vae-2, in: Conference on Neural Information Processing Systems, 2019, pp. 14866–14876.
- Taming transformers for high-resolution image synthesis, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 12868–12878.
- Reparameterizing and dynamically quantizing image features for image generation, Pattern Recognition 146 (2024) 109962.
- SQ-VAE: Variational Bayes on discrete representation with self-annealed stochastic quantization, in: Proceedings of the 39th International Conference on Machine Learning, 2022, pp. 20987–21012.
- Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks, in: International Conference on Machine Learning, PMLR, 2023.
- Hierarchical quantized autoencoders, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.
- Jukebox: A generative model for music, arXiv preprint arXiv:2005.00341 (2020).
- Being Bayesian about categorical probability, in: Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 4950–4961.
- Evidential deep learning to quantify classification uncertainty, in: Advances in Neural Information Processing Systems, 2018.
- Evidential turing processes, in: International Conference on Learning Representations, 2022.
- Lossy image compression with compressive autoencoders, in: International Conference on Learning Representations, 2017.
- Soft-to-hard vector quantization for end-to-end learning compressible representations, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017, p. 1141–1151.
- Variable rate image compression with recurrent neural networks, in: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- Fast decoding in sequence models using discrete latent variables, in: Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 2390–2399.
- Bailando: 3d dance generation via actor-critic gpt with choreographic memory, in: CVPR, 2022.
- Continuous relaxation training of discrete latent variable image models, Bayesian Deep Learning Workshop, NIPS 2017 (2017).
- Evidential deep learning for guided molecular property prediction and discovery, ACS central science 7 (2021) 1356—1367.
- Evidential deep learning for open set action recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13349–13358.
- Uncertainty estimation for stereo matching based on evidential deep learning, Pattern Recognition 124 (2022) 108498.
- Deep evidential regression, Advances in Neural Information Processing Systems 33 (2020).
- A. P. Dempster, A generalization of bayesian inference, in: Classic Works of the Dempster-Shafer Theory of Belief Functions, 1968.
- A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images, Master’s thesis, Department of Computer Science, University of Toronto (2009).
- Deep learning face attributes in the wild, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop, arXiv preprint arXiv:1506.03365 (2015).
- Pixel recurrent neural networks, in: International Conference on Machine Learning, 2016, pp. 1747–1756.
- Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.