Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model (2405.09768v1)

Published 16 May 2024 in eess.AS and cs.SD

Abstract: Recent advances in generative LLMing applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech LLMs (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.

PDF HTML Abstract

Evaluating the Bark Speech LLM for Text-to-Speech

Introduction

Text-to-speech (TTS) technology has seen rapid advancements, introducing new and sophisticated models continuously. One such model is Bark, a generative speech LLM (SLM) that promises to push the boundaries of TTS. Bark uniquely operates without the need for text transcription, functioning through next-token prediction on discrete speech tokens, making it quite different from traditional TTS systems.

Bark: An Overview

Why Bark? There are a few compelling reasons the researchers chose to analyze Bark. Its open-source nature makes both the code and model weights accessible. Furthermore, its attributes mirror those of many state-of-the-art TTS models, making any findings potentially applicable to other similar models. Lastly, Bark's training on a mixed-style dataset (as opposed to purely read or conversational data) provides an excellent opportunity to evaluate its versatility across different speaking styles.

Model Architecture

Bark's architecture comprises three levels of discrete-token models that operate sequentially:

Text-to-Semantic: A transformer model that processes text tokens to produce a sequence of semantic tokens.
Semantic-to-Coarse: Another transformer model that converts the semantic tokens into the first two code books of audio tokens.
Coarse-to-Fine: An encoder-only transformer that generates the remaining audio tokens needed to reconstruct the speech.

This architecture allows Bark to handle various speech synthesis tasks by being fed token prompts from a given speaker, often referred to as "zero-shot" synthesis.

Evaluation Methodology

The paper evaluates Bark's performance across five dimensions: speaking style, intelligibility, speaker consistency, prosody variation, and spontaneous behavior. These evaluations employed a set of 10 speaker prompts for synthesis, ensuring the comparison against multi-speaker VITS—a conventional TTS system—as the baseline.

Text Inputs: Two distinct types of text inputs were used:

Read Speech Text: Extracted from the LibriTTS corpus, representing text from audiobooks.
Conversational Text: Drawn from the DailyDialog corpus, representing dialogue-based text.

Key Evaluations

Intelligibility

To measure how clearly Bark can articulate texts, the team used the Whisper ASR model and calculated Word Error Rate (WER). The results indicate that Bark struggles with intelligibility, showing higher WER across all speakers compared to VITS. Larger Bark models performed slightly better in this metric, suggesting that increased scale improves clarity.

Speaker Consistency

Speaker consistency was assessed using a speaker identification model (ECAPA-TDNN) to calculate speaker similarity scores. The results showed that Bark had a tendency to drift from the conditioned speaker, sometimes sounding like other voices within its training set. Interestingly, VITS demonstrated considerably higher consistency in maintaining speaker identity.

Prosodic Variation

Using measures like fundamental frequency (f0) and speech rate, Bark was found to generate a more varied prosody compared to VITS. This capability to produce diverse prosodic expressions suggests that Bark can potentially deliver more natural and less monotonous speech.

Spontaneous Behavior

The research also highlighted Bark's ability to incorporate spontaneous elements such as fillers (e.g., "um", "uh") and variable pause durations in its synthesized speech. This spontaneous behavior is rarely observed in traditional models like VITS, which tend to produce more scripted and predictable outputs.

Listening Tests

Human listeners evaluated the naturalness and contextual suitability of the synthesized speech. Bark generally performed better than VITS in these subjective tests, but its benefit over VITS was more pronounced in read-speech compared to conversational contexts. Interestingly, providing Bark with prior utterances as prompts did not significantly enhance its performance, raising questions about the model's utilization of context.

Discussion

While Bark shows promise with its natural prosody and spontaneous behavior, it falls short in robustness areas, particularly intelligibility and speaker consistency. This finding opens the door to exploring strategies like scaling to enhance performance. Additionally, the paper underscores the variability among SLMs and highlights the necessity for broader evaluations of emerging models.

Conclusion

The paper provides a comprehensive evaluation of the Bark SLM for TTS, illustrating its strengths in generating varied and natural speech. However, challenges remain in achieving consistent intelligibility and speaker fidelity. These insights set a benchmark for future advancements and evaluations in the field of speech LLMs. For those interested in delving deeper or testing the methods themselves, the researchers have made their evaluation code and tested audio samples publicly available.

PDF Markdown Bookmark Chat (Pro)

References (28)

Authors (2)

Siyang Wang (47 papers)
Éva Székely (14 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ArxivSound/status/1791318651754221702

https://twitter.com/AudioAndSpeech/status/1791380196223844410