Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes (2401.00365v2)

Published 31 Dec 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Soft-to-hard vector quantization for end-to-end learning compressible representations. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  2. Fixing a broken ELBO. arXiv preprint arXiv:1711.00464, 2017.
  3. Structured denoising diffusion models in discrete state-spaces. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  17981–17993, 2021.
  4. Efficient-vqgan: Towards high-resolution image generation with efficient vision transformers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7368–7377, 2023.
  5. Maskgit: Masked generative image transformer. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11315–11325, 2022.
  6. Muse: Text-to-image generation via masked generative transformers. In Proc. International Conference on Machine Learning (ICML), pp.  4055–4075, 2023.
  7. Pixelsnail: An improved autoregressive generative model. In Proc. International Conference on Machine Learning (ICML), pp.  864–872. PMLR, 2018.
  8. Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on images. In Proc. International Conference on Learning Representation (ICLR), 2021.
  9. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  10. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  11. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  248–255. Ieee, 2009.
  12. Diffusion models beat GANs on image synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  8780–8794, 2021.
  13. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  14. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  3518–3532, 2021a.
  15. Taming transformers for high-resolution image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12873–12883, 2021b.
  16. Vector quantized diffusion model for text-to-image synthesis. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10696–10706, 2022.
  17. Denoising diffusion probabilistic models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  6840–6851, 2020.
  18. Argmax flows and multinomial diffusion: Learning categorical distributions. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  12454–12465, 2021.
  19. Image-to-image translation with conditional adversarial networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1125–1134, 2017.
  20. Categorical reparameterization with gumbel-softmax. In Proc. International Conference on Learning Representation (ICLR), 2017.
  21. Perceptual losses for real-time style transfer and super-resolution. In Proc. European Conference on Computer Vision (ECCV), pp. 694–711, 2016.
  22. Fast decoding in sequence models using discrete latent variables. In Proc. International Conference on Machine Learning (ICML), pp.  2390–2399, 2018.
  23. Progressive growing of gans for improved quality, stability, and variation. In Proc. International Conference on Learning Representation (ICLR), 2018.
  24. A style-based generator architecture for generative adversarial networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4401–4410, 2019.
  25. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  26. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
  27. Learning multiple layers of features from tiny images. 2009.
  28. Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11523–11532, 2022a.
  29. Draft-and-revise: Effective image generation with contextual rq-transformer. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  30. Conditional sound generation using neural discrete time-frequency representation learning. In IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), pp.  1–6, 2021.
  31. The concrete distribution: A continuous relaxation of discrete random variables. In Proc. International Conference on Learning Representation (ICLR), 2017.
  32. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  33. Zero-shot text-to-image generation. In Proc. International Conference on Machine Learning (ICML), pp.  8821–8831, 2021.
  34. Generating diverse high-fidelity images with VQ-VAE-2. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  14866–14876, 2019.
  35. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10684–10695, 2022.
  36. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.
  37. A dataset and taxonomy for urban sound research. In ACM Int. Conf. on Multimedia (ACM MM), pp.  1041–1044, 2014.
  38. webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 6(1), 2018.
  39. B Series. Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly, 2014.
  40. Bit prioritization in variational autoencoders via progressive coding. In International Conference on Machine Learning (ICML), pp. 20141–20155, 2022.
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. International Conference on Machine Learning (ICML), pp.  2256–2265. PMLR, 2015.
  42. Ladder variational autoencoders. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  3738–3746, 2016.
  43. Continuous relaxation training of discrete latent variable image models. In Beysian DeepLearning workshop, NIPS, 2017.
  44. Denoising diffusion implicit models. In Proc. International Conference on Learning Representation (ICLR), 2020.
  45. Preventing oversmoothing in VAE via generalized variance parameterization. Neurocomputing, 509:137–156, 2022a.
  46. SQ-VAE: Variational bayes on discrete representation with self-annealed stochastic quantization. In Proc. International Conference on Machine Learning (ICML), 2022b.
  47. Lossy image compression with compressive autoencoders. In Proc. International Conference on Learning Representation (ICLR), 2017.
  48. Variable rate image compression with recurrent neural networks. In Proc. International Conference on Learning Representation (ICLR), 2016.
  49. Nvae: A deep hierarchical variational autoencoder. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.  19667–19679, 2020.
  50. Pixel recurrent neural networks. In Proc. International Conference on Machine Learning (ICML), pp.  1747–1756, 2016.
  51. Neural discrete representation learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp.  6306–6315, 2017.
  52. Neural data-dependent transform for learned image compression. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  17379–17388, 2022.
  53. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  54. Hierarchical quantized autoencoders. arXiv preprint arXiv:2002.08111, 2020.
  55. Multi-scale residual convolutional encoder decoder with bidirectional long short-term memory for single channel speech enhancement. In Proc. European Signal Process. Conf. (EUSIPCO), pp. 431–435, 2021.
  56. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv preprint arXiv:2207.09983, 2022.
  57. Locally hierarchical auto-regressive modeling for image generation. In Proc. Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.  16360–16372, 2022.
  58. Vector-quantized image modeling with improved VQGAN. In Proc. International Conference on Learning Representation (ICLR), 2022.
  59. SoundStream: An end-to-end neural audio codec. IEEE Trans. Audio, Speech, Lang. Process., 30:495–507, 2021.
  60. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  586–595, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yuhta Takida (32 papers)
  2. Yukara Ikemiya (10 papers)
  3. Takashi Shibuya (32 papers)
  4. Kazuki Shimada (21 papers)
  5. Woosung Choi (20 papers)
  6. Chieh-Hsin Lai (32 papers)
  7. Naoki Murata (29 papers)
  8. Toshimitsu Uesaka (17 papers)
  9. Kengo Uchida (5 papers)
  10. Wei-Hsiang Liao (33 papers)
  11. Yuki Mitsufuji (127 papers)
Citations (8)