Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech (2404.02781v1)

Published 3 Apr 2024 in eess.AS and cs.SD

Abstract: With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, LLMs have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a LLM to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the LLMs and their text tokenization strategies on performances.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations, 2020a.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020b.
  3. Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization. In Proc. Interspeech 2022, 2022.
  4. Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19):6936, 2020.
  5. Pattern recognition and machine learning, volume 4. Springer, 2006.
  6. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  7. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023a.
  8. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023b.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp.  2709–2720. PMLR, 2022.
  11. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio. In Proc. Interspeech 2021, pp.  3670–3674, 2021. doi: 10.21437/Interspeech.2021-1965.
  12. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776, 2020.
  13. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  14. Visqol v3: An open source production ready objective speech and audio metric. In 2020 twelfth international conference on quality of multimedia experience (QoMEX), pp.  1–6. IEEE, 2020.
  15. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  16. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  244–250. IEEE, 2021.
  17. Simple and controllable music generation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  47704–47720. Curran Associates, Inc., 2023.
  18. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. Featured Certification, Reproducibility Certification.
  19. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  20. Yue Yin1 Daijiro Mori1 Seiji Fujimoto. Reazonspeech: A free and massive corpus for japanese asr.
  21. Low bit-rate speech coding with vq-vae and a wavenet decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  735–739. IEEE, 2019.
  22. Robert Gray. Vector quantization. IEEE Assp Magazine, 1(2):4–29, 1984.
  23. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  24. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  25. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  26. KKatsuya Iida. Kokoro speech dataset. https://github.com/kaiidams/Kokoro-Speech-Dataset, 2021.
  27. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
  28. Architecture for variable bitrate neural speech codec with configurable computation complexity. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  861–865. IEEE, 2022.
  29. End-to-end neural speech coding for real-time communications. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  866–870. IEEE, 2022.
  30. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509, 2023.
  31. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7669–7673. IEEE, 2020.
  32. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning, pp.  2390–2399. PMLR, 2018.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  34. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718, 12 2023. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00618.
  35. Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In Ninth Annual Conference of the International Speech Communication Association. Citeseer, 2008.
  36. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pp.  5530–5540. PMLR, 2021.
  37. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  38. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  39. LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus. In Proc. INTERSPEECH 2023, pp.  5496–5500, 2023. doi: 10.21437/Interspeech.2023-1584.
  40. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  41. High-fidelity audio compression with improved rvqgan. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  27980–27993. Curran Associates, Inc., 2023.
  42. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
  43. Voicebox: Text-guided multilingual universal speech generation at scale. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  14005–14034. Curran Associates, Inc., 2023.
  44. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11523–11532, 2022.
  45. Bigvgan: A universal neural vocoder with large-scale training. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  46. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.  6706–6713, 2019.
  47. Variable bitrate discrete neural representations via causal self-attention. In 2nd Pre-registration workshop (NeurIPS 2021), Remote.
  48. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  49. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11976–11986, 2022.
  50. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  51. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp.  2794–2802, 2017.
  52. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Proc. Interspeech 2017, pp.  498–502, 2017. doi: 10.21437/Interspeech.2017-1386.
  53. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  54. Lms with a voice: Spoken language modeling beyond speech tokens. arXiv preprint arXiv:2305.15255, 2023.
  55. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023.
  56. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  57. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  58. Speech resynthesis from discrete disentangled self-supervised representations. In INTERSPEECH 2021-Annual Conference of the International Speech Communication Association, 2021.
  59. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech 2020, pp.  2757–2761, 2020. doi: 10.21437/Interspeech.2020-2826.
  60. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  61. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp.  28492–28518. PMLR, 2023.
  62. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  63. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  64. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  65. Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7577–7581. IEEE, 2022.
  66. Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  2416–2419. IEEE, 2011.
  67. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pp.  749–752. IEEE, 2001.
  68. Peter Roach. English phonetics and phonology paperback with audio CDs (2): A practical course. Cambridge university press, 2009.
  69. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.
  70. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  71. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  72. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  73. Hubert Siuzdak. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis. arXiv preprint arXiv:2306.00814, 2023.
  74. Generating diverse and natural text-to-speech samples using a quantized fine-grained vae and autoregressive prosody prior. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6699–6703. IEEE, 2020.
  75. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.  1–12, 2024. doi: 10.1109/TPAMI.2024.3356232.
  76. XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech. In Proc. INTERSPEECH 2023, pp.  5506–5510, 2023. doi: 10.21437/Interspeech.2023-444.
  77. A Vasuki and PT Vanathi. A review of vector quantization techniques. IEEE Potentials, 25(4):39–47, 2006.
  78. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
  79. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  80. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  81. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2021a.
  82. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  483–498, Online, June 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41.
  83. Vector-quantized image modeling with improved vqgan. In International Conference on Learning Representations, 2021.
  84. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  85. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech 2019, pp.  1526–1530, 2019. doi: 10.21437/Interspeech.2019-2441.
  86. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6182–6186. IEEE, 2022.
  87. NeMo (Inverse) Text Normalization: From Development to Production. In Proc. Interspeech 2021, pp.  4857–4859, 2021.
  88. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jaehyeon Kim (16 papers)
  2. Keon Lee (9 papers)
  3. Seungjun Chung (2 papers)
  4. Jaewoong Cho (26 papers)
Citations (25)