Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound (2405.00233v2)

Published 30 Apr 2024 in cs.SD, cs.AI, cs.MM, eess.AS, and eess.SP

Abstract: LLMs have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of LLMling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient LLMling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates. Our code and demos are available at https://haoheliu.github.io/SemantiCodec/.

Exploring SemantiCodec: An Innovative Approach to Ultra Low Bitrate Audio Compression

Introduction to Semantic Audio Codecs

Audio codecs are crucial tools that help encode and decode digital audio, optimizing it for efficient telecommunication and broadcasting. Traditional audio codecs focus primarily on discarding inaudible parts of sound to compress data, but latest advancements utilize AI to improve both quality and compression rates. Most notably, these AI-driven codecs use techniques like vector quantization, where audio data is transformed into tokens, much like how words are tokenized in NLP.

However, when it comes to efficiently encoding varied audio types (like speech, music, or ambient sounds), maintaining a balance between compression (low bitrate) and audio quality becomes increasingly complex. Addressing this balance is where SemantiCodec, a novel semantic audio codec, makes its mark, achieving impressive compression at ultra low bit rates without sacrificing the quality.

Core Innovations of SemantiCodec

Dual-Encoder Structure: SemantiCodec uses a unique dual-encoder system comprising a semantic encoder and an acoustic encoder. This architecture allows it to effectively compress audio while retaining crucial sound details.

  1. Semantic Encoder: It leverages a machine learning model called AudioMAE, designed for self-supervised learning, which means it learns from the data without needing explicit labels. The encoder processes the audio to extract meaningful features, which are then clustered using k-means to produce a compact representation—referred to as semantic tokens.
  2. Acoustic Encoder: This component captures the finer acoustic details that the semantic encoder might miss. It's essential for restoring the audio to a high quality during decoding.

Token Efficiency: Classic codecs often require high token rates (hundreds of tokens per second), which can hamper computational efficiency. SemantiCodec, however, manages to lower the token rate drastically, to as few as 25 tokens per second, significantly easing the computational load without degrading the audio output.

Diffusion Model-Based Decoder: For reconstructing audio from the encoded tokens, SemantiCodec uses advanced generative models known as diffusion models, acclaimed for their ability to generate high-quality outputs. By conditioning on both semantic and acoustic tokens, the system ensures the reconstructed audio remains both accurate and semantically rich.

Empirical Evaluation and Results

SemantiCodec is thoroughly evaluated against existing codecs like the Descript codec under various metrics:

  • Semantic Richness: It excels in retaining more semantic information at even lower bitrates, important for applications in LLMs and more intuitive audio processing tasks.
  • Reconstruction Quality: Semantically rich and lower bitrates allow for high-quality audio reconstruction, surpassing many state-of-the-art codecs, particularly at bitrates below 1.5 kbps.

The tests confirm that at ultra-low bitrates (as low as 0.31 kbps), SemantiCodec still provides satisfactory audio quality which is competitive with if not superior to rates offered by much higher bitrate systems.

The Path Forward

While SemantiCodec introduces promising advancements in audio processing, future developments could explore even deeper integrations of semantic information. Enhancing the efficiency of the encoding and decoding processes, possibly through further AI optimizations, could allow for real-time applications in more bandwidth-sensitive environments.

Moreover, incorporating multi-modal learning, where the system could learn from not only audio but related modalities like video or text, could pave the way for more robust and versatile semantic audio codecs.

Conclusion

SemantiCodec has made significant strides in demonstrating that it's indeed possible to retain high audio quality at remarkably low bitrates with rich semantic understanding. This codec not only stands to benefit the traditional domains of telecommunications and broadcasting but also opens new avenues in smart devices, streaming services, and AI-powered audio applications, where efficiency and quality are paramount.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. J. M. Valin, K. Vos, and T. B. Terriberry, “Definition of the opus audio codec,” Internet Engineering Task Force Standard, 2012.
  2. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  3. A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  4. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
  5. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  6. D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou, “HiFi-Codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint:2305.02765, 2023.
  7. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “AudioLM: A language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 42, pp. 2523–2544, 2023.
  8. C. Toraman, E. H. Yilmaz, F. Şahinuç, and O. Ozcelik, “Impact of tokenization on language models: An analysis for turkish,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 4, pp. 1–21, 2023.
  9. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint:2302.13971, 2023.
  10. F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “AudioGen: Textually guided audio generation,” International Conference on Learning Representations, 2022.
  11. A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “MusicLM: Generating music from text,” arXiv preprint:2301.11325, 2023.
  12. J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” arXiv preprint:2306.05284, 2023.
  13. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint:2301.02111, 2023.
  14. Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass, “Listen, think, and understand,” in International Conference on Learning Representations, 2023.
  15. P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “AudioPaLM: A large language model that can speak and listen,” arXiv preprint:2306.12925, 2023.
  16. S. Sun, K. Krishna, A. Mattarella-Micke, and M. Iyyer, “Do long-range language models actually use long-range context?” arXiv preprint:2109.09115, 2021.
  17. T. Brychcín and M. Konopík, “Semantic spaces for improving language modeling,” Computer Speech & Language, vol. 28, no. 1, pp. 192–209, 2014.
  18. A. O. Bayer and G. Riccardi, “Semantic language models with deep neural networks,” Computer Speech & Language, vol. 40, pp. 1–22, 2016.
  19. Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa, “Byte Pair Encoding: A text compression scheme that accelerates pattern matching,” 1999.
  20. J. Turian, J. Shier, H. R. Khan, B. Raj, B. W. Schuller, C. J. Steinmetz, C. Malloy, G. Tzanetakis, G. Velarde, K. McNally et al., “HEAR: Holistic evaluation of audio representations,” in NeurIPS Competitions and Demonstrations Track, 2022, pp. 125–145.
  21. P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” Advances in Neural Information Processing Systems, vol. 35, pp. 28 708–28 720, 2022.
  22. J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, 1967, pp. 281–297.
  23. H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” International Conference on Machine Learning, 2023.
  24. J.-M. Valin, K. Vos, and T. Terriberry, “Definition of the opus audio codec,” Tech. Rep., 2012.
  25. M. Dietz, M. Multrus, V. Eksler, V. Malenovsky, E. Norvell, H. Pobloth, L. Miao, Z. Wang, L. Laaksonen, A. Vasilache et al., “Overview of the evs codec architecture,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2015, pp. 5698–5702.
  26. C. Gârbacea, A. van den Oord, Y. Li, F. S. Lim, A. Luebs, O. Vinyals, and T. C. Walters, “Low bit-rate speech coding with vq-vae and a wavenet decoder,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2019, pp. 735–739.
  27. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018.
  28. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  29. A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2021, pp. 3875–3879.
  30. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “Beats: audio pre-training with acoustic tokenizers,” in Proceedings of International Conference on Machine Learning, 2023, pp. 5178–5193.
  31. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
  32. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv preprint:1312.6114, 2013.
  33. D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in International conference on machine learning.   Proceedings of Machine Learning Research, 2015, pp. 1530–1538.
  34. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in International Conference on Machine Learning, 2021, pp. 8599–8608.
  35. J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proceedings of International Conference on Machine Learning.   PMLR, 2021, pp. 5530–5540.
  36. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-end text to speech synthesis with human-level quality,” arXiv preprint:2205.04421, 2022.
  37. Y. Leng, Z. Chen, J. Guo, H. Liu, J. Chen, X. Tan, D. Mandic, L. He, X.-Y. Li, T. Qin et al., “BinauralGrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis,” Advances in Neural Information Processing Systems, 2022.
  38. Z. Chen, Y. Wu, Y. Leng, J. Chen, H. Liu, X. Tan, Y. Cui, K. Wang, L. He, S. Zhao, J. Bian, and D. Mandic, “ResGrad: Residual denoising diffusion probabilistic models for text to speech,” arXiv preprint:2212.14518, 2022.
  39. K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2024, pp. 1206–1210.
  40. V. Iashin and E. Rahtu, “Taming visually guided sound generation,” in British Machine Vision Conference, 2021.
  41. R. Sheffer and Y. Adi, “I hear your true colors: Image guided audio generation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  42. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
  43. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
  44. D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction-tuned LLM and latent diffusion model,” arXiv preprint:2304.13731, 2023.
  45. H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley, “Audiosr: Versatile audio super-resolution at scale,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, pp. 1076–1080.
  46. R. Huang, J. Huang, D. Yang, Y. Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio: Text-to-audio generation with prompt-enhanced diffusion models,” International Conference on Machine Learning, 2023.
  47. H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint:2308.05734, 2023.
  48. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
  49. H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” arXiv preprint:1402.1128, 2014.
  50. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 6840–6851.
  51. S. Lin, B. Liu, J. Li, and X. Yang, “Common diffusion noise schedules and sample steps are flawed,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5404–5411.
  52. T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in International Conference on Learning Representations, 2021.
  53. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2020.
  54. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  55. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021.
  56. A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in International Conference on Machine Learning, 2022, pp. 16 784–16 804.
  57. G. Chen, S. Chai, G. Wang, J. Du, W. Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “GigaSpeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” in INTERSPEECH, 2021, pp. 4376–4380.
  58. H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “VoiceFixer: Toward general speech restoration with neural vocoder,” arXiv preprint:2109.13731, 2021.
  59. T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” International Society for Music Information Retrieval Conference, pp. 591–596, 2011.
  60. R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “Medleydb: A multitrack dataset for annotation-intensive mir research.” in ISMIR, vol. 14, 2014, pp. 155–160.
  61. Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1117372
  62. Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep ResUNet for music source separation,” International Society for Music Information Retrieval Conference, pp. 342–349, 2021.
  63. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “AudioSet: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 776–780.
  64. X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint:2303.17395, 2023.
  65. H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “VGGSound: A large-scale audio-visual dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 721–725.
  66. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint:1904.02882, 2019.
  67. J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in Proceedings of International Conference on Machine Learning, 2017, pp. 1068–1077.
  68. K. J. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the ACM International Conference on Multimedia, 2015, pp. 1015–1018.
  69. F.-R. Stöter, S. Chakrabarty, E. Habets, and B. Edler, “LibriCount: A dataset for speaker count estimation,” Apr 2018.
  70. H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.
  71. B. Kim, M. Ghei, B. Pardo, and Z. Duan, “Vocal Imitation Set: a dataset of vocally imitated sound events using the audioset ontology.” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2018, pp. 148–152.
  72. P. Warden, “Speech Commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint:1804.03209, 2018.
  73. A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “ViSQOL: an objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, pp. 1–18, 2015.
  74. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proceedings of International Conference on Machine Learning, 2023, pp. 28 492–28 518.
  75. N. Schinkel-Bielefeld, N. Lotze, and F. Nagel, “Audio quality evaluation by experienced and inexperienced listeners,” in Proceedings of Meetings on Acoustics, vol. 19, no. 1, 2013.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haohe Liu (59 papers)
  2. Xuenan Xu (29 papers)
  3. Yi Yuan (54 papers)
  4. Mengyue Wu (57 papers)
  5. Wenwu Wang (148 papers)
  6. Mark D. Plumbley (114 papers)
Citations (12)