Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pheme: Efficient and Conversational Speech Generation (2401.02839v1)

Published 5 Jan 2024 in eess.AS, cs.AI, and cs.CL

Abstract: In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with LLMs might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

Efficient and Conversational Speech Generation

This paper addresses the challenges of developing efficient and conversational Text-to-Speech (TTS) systems, proposing a new model series aimed at synthesizing natural, human-like speech in real time, with significant improvements in data efficiency and inference speed. Current state-of-the-art TTS models such as VALL-E and SoundStorm require substantial neural architectures and large-scale datasets, making them impractical for real-time applications like assistive conversational systems. The model introduced in this work provides a compact yet high-performing alternative, achieving similar audio quality with over ten times less training data.

Key Contributions

  1. Compact and High-Performance Models: The proposed approach enhances the efficiency of TTS models by employing a streamlined architecture that maintains high performance. This is achieved by leveraging smaller datasets of conversational speech, reducing the data demand significantly.
  2. Parallel Speech Generation: The new model utilizes non-autoregressive parallel decoding strategies, inspired by MaskGIT-style inference, to enhance inference speed without compromising audio quality. This parallel processing approach contrasts with the autoregressive nature of previous models, dramatically reducing latency times.
  3. Teacher-Student Distillation for Voice Quality Improvement: By employing a teacher-student distillation approach with larger models, naturalness and voice quality enhancements are realized even over smaller datasets. This data-efficient method allows for effective model specialization to single-speaker scenarios using synthetic data generated by third-party providers.

Evaluation and Results

The model is evaluated on various aspects of TTS, including richness and naturalness of prosody, intelligibility of speech, and inference efficiency:

  • Word Error Rate (WER): The proposed model achieves a WER of 12.4%, demonstrating improved intelligibility over comparable models like MQTTS with a WER of 14.2%.
  • Speaker Similarity Score (SSS): Despite lower SSS compared to MQTTS (0.594 versus 0.682), the model remains competitive, preserving essential speaker characteristics.
  • Mel-cepstral Distortion (MCD): The model showed a promising MCD of 8.838, indicating better reconstruction fidelity.
  • Fréchet Inception Distance (FID): A FID of 20.349 illustrates superior generation diversity and naturalness.

In terms of efficiency, the model shows a remarkable improvement, achieving a Real-Time Factor (RTF) of 0.133 for sentence processing, substantially cutting down the processing times required by prior solutions like MQTTS.

Implications and Future Directions

The implications of this work are significant across both academic and industrial domains. In practice, the model can be effectively applied to build real-time conversational systems such as voice assistants, where user satisfaction is heavily dependent on the naturalness and immediacy of interactions. The theoretical contributions lie in evidence that robust TTS systems can be developed with minimalist designs yet maintaining high fidelity.

Future exploration could focus on refining the T2S components to reduce computational loads further or applying newer efficient parallel decoding strategies. Incorporating multilingual capabilities or synthesizing longer utterances could address broader application needs. Moreover, expanding publicly available high-quality conversational datasets would bolster research in this area, enabling further innovation in TTS technology.

By providing a high-performing, data-efficient TTS framework, this research sets a new precedent for future developments in speech synthesis, challenging the paradigm of resource-intensive model design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Musiclm: Generating music from text. CoRR, abs/2301.11325, 2023. URL https://doi.org/10.48550/arXiv.2301.11325.
  2. AudioLM: A language modeling approach to audio generation. IEEE ACM Transactions on Audio, Speech and Language Processing, 31:2523–2533, 2023a. URL https://doi.org/10.1109/TASLP.2023.3288409.
  3. SoundStorm: Efficient parallel audio generation. CoRR, abs/2305.09636, 2023b. doi: 10.48550/ARXIV.2305.09636. URL https://doi.org/10.48550/arXiv.2305.09636.
  4. MaskGIT: Masked generative image transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp.  11305–11315. IEEE, 2022. URL https://doi.org/10.1109/CVPR52688.2022.01103.
  5. GigaSpeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp.  3670–3674. ISCA, 2021. URL https://doi.org/10.21437/Interspeech.2021-1965.
  6. A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp.  12644–12652. AAAI Press, 2023a. URL https://doi.org/10.1609/aaai.v37i11.26488.
  7. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022. URL https://doi.org/10.1109/JSTSP.2022.3188113.
  8. BEATs: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  5178–5193. PMLR, 2023b. URL https://proceedings.mlr.press/v202/chen23ag.html.
  9. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/ARXIV.2210.13438. URL https://doi.org/10.48550/arXiv.2210.13438.
  10. VampNet: Music generation via masked acoustic token modeling. CoRR, abs/2307.04686, 2023. URL https://doi.org/10.48550/arXiv.2307.04686.
  11. Conformer: Convolution-augmented transformer for speech recognition. In Helen Meng, Bo Xu, and Thomas Fang Zheng (eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pp.  5036–5040. ISCA, 2020. URL https://doi.org/10.21437/Interspeech.2020-3015.
  12. Espnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp.  7654–7658. IEEE, 2020. URL https://doi.org/10.1109/ICASSP40776.2020.9053512.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  6626–6637, 2017.
  14. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Transactions on Audio, Speech and Language Processing, 29:3451–3460, 2021. URL https://doi.org/10.1109/TASLP.2021.3122291.
  15. Ensemble knowledge distillation of self-supervised speech models. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp.  1–5. IEEE, 2023a. URL https://doi.org/10.1109/ICASSP49357.2023.10096445.
  16. Audiogpt: Understanding and generating speech, music, sound, and talking head. CoRR, abs/2304.12995, 2023b. URL https://doi.org/10.48550/arXiv.2304.12995.
  17. Libri-Light: A benchmark for ASR with limited or no supervision. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pp.  7669–7673. IEEE, 2020. URL https://doi.org/10.1109/ICASSP40776.2020.9052942.
  18. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. CoRR, abs/2302.03540, 2023. URL https://doi.org/10.48550/arXiv.2302.03540.
  19. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139, pp.  5530–5540. PMLR, 2021. URL http://proceedings.mlr.press/v139/kim21f.html.
  20. Audiogen: Textually guided audio generation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=CYK7RfcOzQ4.
  21. R. Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, volume 1, pp.  125–128 vol.1, 1993. doi: 10.1109/PACRIM.1993.407206.
  22. High-fidelity audio compression with improved RVQGAN. CoRR, abs/2306.06546, 2023. URL https://doi.org/10.48550/arXiv.2306.06546.
  23. Sparks of large audio models: A survey and outlook. CoRR, abs/2308.12792, 2023. URL https://doi.org/10.48550/arXiv.2308.12792.
  24. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  19274–19286. PMLR, 2023. URL https://proceedings.mlr.press/v202/leviathan23a.html.
  25. Styletts-vc: One-shot voice conversion by knowledge transfer from style-based TTS models. In IEEE Spoken Language Technology Workshop, SLT 2022, Doha, Qatar, January 9-12, 2023, pp.  920–927. IEEE, 2022. URL https://doi.org/10.1109/SLT54892.2023.10022498.
  26. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  27. EfficientTTS: An efficient and high-quality text-to-speech architecture. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  7700–7709. PMLR, 2021. URL http://proceedings.mlr.press/v139/miao21a.html.
  28. Meinard Müller. Dynamic time warping. Information Retrieval for Music and Motion, pp.  69–84, 2007.
  29. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pp.  5206–5210. IEEE, 2015. URL https://doi.org/10.1109/ICASSP.2015.7178964.
  30. DPHuBERT: Joint distillation and pruning of self-supervised speech models. CoRR, abs/2305.17651, 2023. URL https://doi.org/10.48550/arXiv.2305.17651.
  31. Modular deep learning. CoRR, abs/2302.11529, 2023. URL https://doi.org/10.48550/arXiv.2302.11529.
  32. Powerset multi-class cross entropy loss for neural speaker diarization. CoRR, abs/2310.13025, 2023. URL https://doi.org/10.48550/arXiv.2310.13025.
  33. MLS: A large-scale multilingual dataset for speech research. In Helen Meng, Bo Xu, and Thomas Fang Zheng (eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pp.  2757–2761. ISCA, 2020. URL https://doi.org/10.21437/Interspeech.2020-2826.
  34. Scaling speech technology to 1, 000+ languages. CoRR, abs/2305.13516, 2023. URL https://doi.org/10.48550/arXiv.2305.13516.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  36. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. CoRR, abs/2304.09116, 2023. URL https://doi.org/10.48550/arXiv.2304.09116.
  37. RoFormer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  38. SUPERB-SG: enhanced speech processing universal performance benchmark for semantic and generative capabilities. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp.  8479–8492. Association for Computational Linguistics, 2022. URL https://doi.org/10.18653/v1/2022.acl-long.580.
  39. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017.
  40. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023. URL https://doi.org/10.48550/arXiv.2301.02111.
  41. Adaspeech 4: Adaptive text to speech in zero-shot scenarios. In Hanseok Ko and John H. L. Hansen (eds.), Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pp.  2568–2572. ISCA, 2022. URL https://doi.org/10.21437/Interspeech.2022-901.
  42. M22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-ctts: End-to-end multi-scale multi-modal conversational text-to-speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp.  1–5. IEEE, 2023. URL https://doi.org/10.1109/ICASSP49357.2023.10096905.
  43. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17.
  44. InstructTTS: Modelling expressive TTS in discrete latent space with natural language style prompt. CoRR, abs/2301.13662, 2023. URL https://doi.org/10.48550/arXiv.2301.13662.
  45. Soundstream: An end-to-end neural audio codec. IEEE ACM Transactions on Audio, Speech and Language Processing, 30:495–507, 2022. URL https://doi.org/10.1109/TASLP.2021.3129994.
  46. LibriTTS: A corpus derived from librispeech for text-to-speech. In Gernot Kubin and Zdravko Kacic (eds.), Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp.  1526–1530. ISCA, 2019. URL https://doi.org/10.21437/Interspeech.2019-2441.
  47. Denoispeech: Denoising text to speech with frame-level noise modeling. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp.  7063–7067. IEEE, 2021. URL https://doi.org/10.1109/ICASSP39728.2021.9413934.
  48. SpeechTokenizer: Unified speech tokenizer for speech large language models. CoRR, abs/2308.16692, 2023. URL https://doi.org/10.48550/arXiv.2308.16692.
  49. Vec-Tok Speech: Speech vectorization and tokenization for neural speech generation. CoRR, abs/2310.07246, 2023. URL https://doi.org/10.48550/arXiv.2310.07246.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Paweł Budzianowski (27 papers)
  2. Taras Sereda (2 papers)
  3. Tomasz Cichy (2 papers)
  4. Ivan Vulić (130 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com