Evaluating the Bark Speech LLM for Text-to-Speech
Introduction
Text-to-speech (TTS) technology has seen rapid advancements, introducing new and sophisticated models continuously. One such model is Bark, a generative speech LLM (SLM) that promises to push the boundaries of TTS. Bark uniquely operates without the need for text transcription, functioning through next-token prediction on discrete speech tokens, making it quite different from traditional TTS systems.
Bark: An Overview
Why Bark? There are a few compelling reasons the researchers chose to analyze Bark. Its open-source nature makes both the code and model weights accessible. Furthermore, its attributes mirror those of many state-of-the-art TTS models, making any findings potentially applicable to other similar models. Lastly, Bark's training on a mixed-style dataset (as opposed to purely read or conversational data) provides an excellent opportunity to evaluate its versatility across different speaking styles.
Model Architecture
Bark's architecture comprises three levels of discrete-token models that operate sequentially:
- Text-to-Semantic: A transformer model that processes text tokens to produce a sequence of semantic tokens.
- Semantic-to-Coarse: Another transformer model that converts the semantic tokens into the first two code books of audio tokens.
- Coarse-to-Fine: An encoder-only transformer that generates the remaining audio tokens needed to reconstruct the speech.
This architecture allows Bark to handle various speech synthesis tasks by being fed token prompts from a given speaker, often referred to as "zero-shot" synthesis.
Evaluation Methodology
The paper evaluates Bark's performance across five dimensions: speaking style, intelligibility, speaker consistency, prosody variation, and spontaneous behavior. These evaluations employed a set of 10 speaker prompts for synthesis, ensuring the comparison against multi-speaker VITS—a conventional TTS system—as the baseline.
Text Inputs: Two distinct types of text inputs were used:
- Read Speech Text: Extracted from the LibriTTS corpus, representing text from audiobooks.
- Conversational Text: Drawn from the DailyDialog corpus, representing dialogue-based text.
Key Evaluations
Intelligibility
To measure how clearly Bark can articulate texts, the team used the Whisper ASR model and calculated Word Error Rate (WER). The results indicate that Bark struggles with intelligibility, showing higher WER across all speakers compared to VITS. Larger Bark models performed slightly better in this metric, suggesting that increased scale improves clarity.
Speaker Consistency
Speaker consistency was assessed using a speaker identification model (ECAPA-TDNN) to calculate speaker similarity scores. The results showed that Bark had a tendency to drift from the conditioned speaker, sometimes sounding like other voices within its training set. Interestingly, VITS demonstrated considerably higher consistency in maintaining speaker identity.
Prosodic Variation
Using measures like fundamental frequency (f0) and speech rate, Bark was found to generate a more varied prosody compared to VITS. This capability to produce diverse prosodic expressions suggests that Bark can potentially deliver more natural and less monotonous speech.
Spontaneous Behavior
The research also highlighted Bark's ability to incorporate spontaneous elements such as fillers (e.g., "um", "uh") and variable pause durations in its synthesized speech. This spontaneous behavior is rarely observed in traditional models like VITS, which tend to produce more scripted and predictable outputs.
Listening Tests
Human listeners evaluated the naturalness and contextual suitability of the synthesized speech. Bark generally performed better than VITS in these subjective tests, but its benefit over VITS was more pronounced in read-speech compared to conversational contexts. Interestingly, providing Bark with prior utterances as prompts did not significantly enhance its performance, raising questions about the model's utilization of context.
Discussion
While Bark shows promise with its natural prosody and spontaneous behavior, it falls short in robustness areas, particularly intelligibility and speaker consistency. This finding opens the door to exploring strategies like scaling to enhance performance. Additionally, the paper underscores the variability among SLMs and highlights the necessity for broader evaluations of emerging models.
Conclusion
The paper provides a comprehensive evaluation of the Bark SLM for TTS, illustrating its strengths in generating varied and natural speech. However, challenges remain in achieving consistent intelligibility and speaker fidelity. These insights set a benchmark for future advancements and evaluations in the field of speech LLMs. For those interested in delving deeper or testing the methods themselves, the researchers have made their evaluation code and tested audio samples publicly available.