Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model (2405.09768v1)

Published 16 May 2024 in eess.AS and cs.SD
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Abstract: Recent advances in generative LLMing applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech LLMs (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.

Evaluating the Bark Speech LLM for Text-to-Speech

Introduction

Text-to-speech (TTS) technology has seen rapid advancements, introducing new and sophisticated models continuously. One such model is Bark, a generative speech LLM (SLM) that promises to push the boundaries of TTS. Bark uniquely operates without the need for text transcription, functioning through next-token prediction on discrete speech tokens, making it quite different from traditional TTS systems.

Bark: An Overview

Why Bark? There are a few compelling reasons the researchers chose to analyze Bark. Its open-source nature makes both the code and model weights accessible. Furthermore, its attributes mirror those of many state-of-the-art TTS models, making any findings potentially applicable to other similar models. Lastly, Bark's training on a mixed-style dataset (as opposed to purely read or conversational data) provides an excellent opportunity to evaluate its versatility across different speaking styles.

Model Architecture

Bark's architecture comprises three levels of discrete-token models that operate sequentially:

  1. Text-to-Semantic: A transformer model that processes text tokens to produce a sequence of semantic tokens.
  2. Semantic-to-Coarse: Another transformer model that converts the semantic tokens into the first two code books of audio tokens.
  3. Coarse-to-Fine: An encoder-only transformer that generates the remaining audio tokens needed to reconstruct the speech.

This architecture allows Bark to handle various speech synthesis tasks by being fed token prompts from a given speaker, often referred to as "zero-shot" synthesis.

Evaluation Methodology

The paper evaluates Bark's performance across five dimensions: speaking style, intelligibility, speaker consistency, prosody variation, and spontaneous behavior. These evaluations employed a set of 10 speaker prompts for synthesis, ensuring the comparison against multi-speaker VITS—a conventional TTS system—as the baseline.

Text Inputs: Two distinct types of text inputs were used:

  • Read Speech Text: Extracted from the LibriTTS corpus, representing text from audiobooks.
  • Conversational Text: Drawn from the DailyDialog corpus, representing dialogue-based text.

Key Evaluations

Intelligibility

To measure how clearly Bark can articulate texts, the team used the Whisper ASR model and calculated Word Error Rate (WER). The results indicate that Bark struggles with intelligibility, showing higher WER across all speakers compared to VITS. Larger Bark models performed slightly better in this metric, suggesting that increased scale improves clarity.

Speaker Consistency

Speaker consistency was assessed using a speaker identification model (ECAPA-TDNN) to calculate speaker similarity scores. The results showed that Bark had a tendency to drift from the conditioned speaker, sometimes sounding like other voices within its training set. Interestingly, VITS demonstrated considerably higher consistency in maintaining speaker identity.

Prosodic Variation

Using measures like fundamental frequency (f0) and speech rate, Bark was found to generate a more varied prosody compared to VITS. This capability to produce diverse prosodic expressions suggests that Bark can potentially deliver more natural and less monotonous speech.

Spontaneous Behavior

The research also highlighted Bark's ability to incorporate spontaneous elements such as fillers (e.g., "um", "uh") and variable pause durations in its synthesized speech. This spontaneous behavior is rarely observed in traditional models like VITS, which tend to produce more scripted and predictable outputs.

Listening Tests

Human listeners evaluated the naturalness and contextual suitability of the synthesized speech. Bark generally performed better than VITS in these subjective tests, but its benefit over VITS was more pronounced in read-speech compared to conversational contexts. Interestingly, providing Bark with prior utterances as prompts did not significantly enhance its performance, raising questions about the model's utilization of context.

Discussion

While Bark shows promise with its natural prosody and spontaneous behavior, it falls short in robustness areas, particularly intelligibility and speaker consistency. This finding opens the door to exploring strategies like scaling to enhance performance. Additionally, the paper underscores the variability among SLMs and highlights the necessity for broader evaluations of emerging models.

Conclusion

The paper provides a comprehensive evaluation of the Bark SLM for TTS, illustrating its strengths in generating varied and natural speech. However, challenges remain in achieving consistent intelligibility and speaker fidelity. These insights set a benchmark for future advancements and evaluations in the field of speech LLMs. For those interested in delving deeper or testing the methods themselves, the researchers have made their evaluation code and tested audio samples publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
  2. James Betker. 2023. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243.
  3. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  4. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636.
  5. A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12644–12652.
  6. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  7. Alain De Cheveigné and Hideki Kawahara. 2002. Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917–1930.
  8. High fidelity neural audio compression. Transactions on Machine Learning Research.
  9. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Proc. Interspeech 2020. International Speech Communication Association.
  10. International Telecommunication Union, Telecommunication Standardization Sector. 1996. Methods for subjective determination of transmission quality. ITU Recommendation ITU-T P.800.
  11. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
  12. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
  13. Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093.
  14. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  15. Freevc: Towards high-quality text-free one-shot voice conversion. In Proc. ICASSP, pages 1–5. IEEE.
  16. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv preprint arXiv:2309.07937.
  17. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638.
  18. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  19. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  20. Suno-AI. 2023. Bark: Text-prompted generative audio model. https://github.com/suno-ai/bark.
  21. Jason Taylor and Korin Richmond. 2021. Confidence intervals for asr-based tts evaluation. In Proc. Interspeech, pages 2791–2795. International Speech Communication Association.
  22. It’s not what you said, it’s how you said it: discriminative perception of speech as a multichannel communication system. In Proc. Interspeech 2021. International Speech Communication Association.
  23. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
  24. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
  25. Speechx: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873.
  26. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proc. of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995.
  27. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15.
  28. LibriTTS: A corpus derived from librispeech for text-to-speech. In Proc. Interspeech.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Siyang Wang (47 papers)
  2. Éva Székely (14 papers)
Citations (3)