Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
38 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models (2402.13071v3)

Published 20 Feb 2024 in eess.AS and cs.SD

Abstract: The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark. It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs. Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons. Finally, we will release codes, the leaderboard, and data to accelerate progress within the community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. 2015. Learn about pearson’s correlation coefficient in spss with data from the global health observatory data (2012).
  2. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
  3. Leigh D Alsteris and Kuldip K Paliwal. 2007. Short-time phase spectrum in speech processing: A review and some experimental results. Digital signal processing, 17(3):578–616.
  4. Quesst2014: Evaluating query-by-example speech search in a zero-resource setting with real-life queries. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5833–5837. IEEE.
  5. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  6. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636.
  7. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
  8. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377–390.
  9. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673.
  10. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  11. Davide Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13.
  12. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622.
  13. Seth Cooper and Steven Shaw. 2020. Gunshots recorded in an open field using ipod touch devices. Dryad, Dataset.
  14. Simple and controllable music generation. arXiv preprint arXiv:2306.05284.
  15. Librimix: An open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262.
  16. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
  17. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143.
  18. Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec. arXiv preprint arXiv:2309.07405.
  19. Neural audio synthesis of musical notes with WaveNet autoencoders. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1068–1077. PMLR.
  20. FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852.
  21. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA.
  22. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575.
  23. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  24. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  25. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3945–3954.
  26. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71:1–15.
  27. Vocal imitation set: a dataset of vocally imitated sound events using the audioset ontology. In DCASE, pages 148–152.
  28. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033.
  29. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352.
  30. R. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, volume 1, pages 125–128 vol.1.
  31. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32.
  32. High-fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546.
  33. Semi-supervised spoken language understanding via self-supervised speech and language model pretraining. In ICASSP, pages 7468–7472. IEEE.
  34. Stack-and-delay: a new codebook pattern for music generation. arXiv preprint arXiv:2309.08804.
  35. Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Frontiers in Computer Science, 2:14.
  36. Speech model pre-training for end-to-end spoken language understanding. In Proc. of Interspeech.
  37. Voxceleb: A large-scale speaker identification dataset. In INTERSPEECH, pages 2616–2620. ISCA.
  38. Librispeech: An ASR corpus based on public domain audio books. In ICASSP, pages 5206–5210. IEEE.
  39. Karol J. Piczak. 2015. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press.
  40. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  41. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749–752. IEEE.
  42. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  43. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5329–5333. IEEE.
  44. Classification vs. regression in supervised learning for single channel speaker count estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 436–440. IEEE.
  45. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pages 4214–4217. IEEE.
  46. Seanet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095.
  47. George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5):293–302.
  48. Jörgen Valk and Tanel Alumäe. 2021. VoxLingua107: a dataset for spoken language recognition. In Proc. IEEE SLT Workshop.
  49. Attention is all you need. Advances in neural information processing systems, 30.
  50. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
  51. Zero-shot singing voice synthesis from musical score. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
  52. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
  53. Speechx: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873.
  54. Pete Warden. 2017. Speech commands: A public dataset for single-word speech recognition. Dataset available online.
  55. Vocalset: A singing voice dataset. In International Society for Music Information Retrieval Conference.
  56. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  57. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765.
  58. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704.
  59. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
  60. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882.
  61. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  62. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692.
  63. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926.
  64. Neural networks fail to learn periodic functions and how to fix it. Advances in Neural Information Processing Systems, 33:1583–1594.
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.