Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models (2410.24177v1)

Published 31 Oct 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Spoken LLMs (SLMs) have gained increasing attention with advancements in text-based, decoder-only LLMs. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Common voice: A massively-multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4218–4222, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  3. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.  1298–1312. PMLR, 2022.
  4. Audiolm: A language modeling approach to audio generation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31:2523–2533, June 2023. ISSN 2329-9290. doi: 10.1109/TASLP.2023.3288409. URL https://doi.org/10.1109/TASLP.2023.3288409.
  5. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  6. R-spin: Efficient speaker and noise-invariant representation learning with acoustic pieces. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  642–662, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.36. URL https://aclanthology.org/2024.naacl-long.36.
  7. Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7087–7091, 2022. doi: 10.1109/ICASSP43922.2022.9747490.
  8. Self-supervised fine-tuning for improved content representations by speaker-invariant clustering. In INTERSPEECH 2023, pp.  2983–2987, 2023. doi: 10.21437/Interspeech.2023-847.
  9. The interspeech 2024 challenge on speech processing using discrete units. In Interspeech 2024, pp.  2559–2563, 2024. doi: 10.21437/Interspeech.2024-1878.
  10. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022. doi: 10.1109/JSTSP.2022.3188113.
  11. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pp.  3915–3924. PMLR, 2022.
  12. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. Advances in Neural Information Processing Systems, 34:16251–16265, 2021.
  13. Self-supervised speech representations are more phonetic than semantic. In Interspeech 2024, pp.  4578–4582, 2024. doi: 10.21437/Interspeech.2024-1157.
  14. One-shot voice conversion by separating speaker and content representations with instance normalization. In Interspeech 2019, pp.  664–668, 2019. doi: 10.21437/Interspeech.2019-2663.
  15. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  16. An unsupervised autoregressive model for speech representation learning. In Interspeech 2019, pp.  146–150, 2019. doi: 10.21437/Interspeech.2019-1473.
  17. The fisher corpus: A resource for the next generations of speech-to-text. In LREC, volume 4, pp.  69–71, 2004.
  18. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=ivCd8z8zR2.
  19. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  20. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  21. Augmentation invariant discrete representation for generative spoken language modeling. In Elizabeth Salesky, Marcello Federico, and Marine Carpuat (eds.), Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), 2023. doi: 10.18653/v1/2023.iwslt-1.46. URL https://aclanthology.org/2023.iwslt-1.46.
  22. Google Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  23. Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  1–8. IEEE, 2023.
  24. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp.  369–376, 2006.
  25. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech 2020, pp.  5036–5040, 2020. doi: 10.21437/Interspeech.2020-3015.
  26. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36, 2024.
  27. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.
  28. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pp.  1501–1510, 2017.
  29. Categorical reparameterization with gumbel-softmax. In ICLR, 2017. URL https://openreview.net/forum?id=rkE3y85ee.
  30. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7669–7673. IEEE, 2020.
  31. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
  32. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. doi: 10.1162/tacl˙a˙00430. URL https://aclanthology.org/2021.tacl-1.79.
  33. Efficient backprop. In Neural networks: Tricks of the trade, pp.  9–50. Springer, 2002.
  34. Direct speech-to-speech translation with discrete units. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3327–3339, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.235. URL https://aclanthology.org/2022.acl-long.235.
  35. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning. Advances in Neural Information Processing Systems, 36, 2023.
  36. TERA: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021. doi: 10.1109/TASLP.2021.3095662.
  37. An embarrassingly simple approach for llm with strong asr capacity. arXiv preprint arXiv:2402.08846, 2024.
  38. Voxtlm: Unified decoder-only models for consolidating speech recognition, synthesis and speech, text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  13326–13330. IEEE, 2024.
  39. Nast: Noise aware speech tokenization for speech language models. In Interspeech 2024, pp.  4169–4173, 2024. doi: 10.21437/Interspeech.2024-288.
  40. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022. doi: 10.1109/JSTSP.2022.3207050.
  41. Dasb–discrete audio and speech benchmark. arXiv preprint arXiv:2406.14294, 2024.
  42. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing, 2020.
  43. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. In INTERSPEECH 2023, pp.  4823–4827, 2023. doi: 10.21437/Interspeech.2023-1905.
  44. Spirit-lm: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755, 2024.
  45. fairseq: A fast, extensible toolkit for sequence modeling. In Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.  48–53, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-4009. URL https://aclanthology.org/N19-4009.
  46. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
  47. g2pe. https://github.com/Kyubyong/g2p, 2019.
  48. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  914–921, 2021. doi: 10.1109/ASRU51503.2021.9688093.
  49. Speech resynthesis from discrete disentangled self-supervised representations. In Interspeech 2021, pp.  3615–3619, 2021. doi: 10.21437/Interspeech.2021-475.
  50. Mls: A large-scale multilingual dataset for speech research. In Interspeech 2020, pp.  2757–2761, 2020. doi: 10.21437/Interspeech.2020-2826.
  51. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp.  28492–28518. PMLR, 2023.
  52. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  53. UTMOS: Utokyo-sarulab system for voicemos challenge 2022. In Interspeech 2022, pp.  4521–4525, 2022. doi: 10.21437/Interspeech.2022-439.
  54. Thomas Schatz. ABX-discriminability measures and applications. PhD thesis, Université Paris 6 (UPMC), 2016.
  55. Seamless: Multilingual expressive and streaming speech translation. arXiv preprint arXiv:2312.05187, 2023.
  56. Multi-resolution huBERT: Multi-resolution speech self-supervised learning with masked unit prediction. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=kUuKFW7DIF.
  57. Mmm: Multi-layer multi-residual multi-stream discrete speech representation from self-supervised learning model. In Interspeech 2024, pp.  2569–2573, 2024b. doi: 10.21437/Interspeech.2024-2251.
  58. SALMONN: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2024.
  59. SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8479–8492, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.580. URL https://aclanthology.org/2022.acl-long.580.
  60. Voxlingua107: a dataset for spoken language recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp.  652–658. IEEE, 2021.
  61. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  62. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  993–1003, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URL https://aclanthology.org/2021.acl-long.80.
  63. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  64. Towards audio language modeling-an overview. arXiv preprint arXiv:2402.13236, 2024.
  65. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  66. Superb: Speech processing universal performance benchmark. In Interspeech 2021, pp.  1194–1198, 2021. doi: 10.21437/Interspeech.2021-1775.
  67. A large-scale evaluation of speech foundation models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2884–2899, 2024. doi: 10.1109/TASLP.2024.3389631.
  68. Torchaudio: Building blocks for audio and speech processing. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6982–6986, 2022. doi: 10.1109/ICASSP43922.2022.9747236.
  69. Estimating the completeness of discrete speech units. arXiv preprint arXiv:2409.06109, 2024.
  70. mhubert-147: A compact multilingual hubert model. In Interspeech 2024, pp.  3939–3943, 2024. doi: 10.21437/Interspeech.2024-938.
  71. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  72. Speechtokenizer: Unified speech tokenizer for speech language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AF9Q8Vip84.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Heng-Jui Chang (16 papers)
  2. Hongyu Gong (44 papers)
  3. Changhan Wang (46 papers)
  4. James Glass (173 papers)
  5. Yu-An Chung (33 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com