Contextual information encoded by Bark’s prompt tokens

Determine whether the prompt tokens used to condition Bark, a three-level discrete-token speech language model for text-to-speech synthesis, convey prosodic and semantic context beyond speaker identity—particularly when the prompt is taken from the prior utterance—and ascertain whether such contextual prompting leads the model to generate more meaningful speech.

Background

Bark is a discrete-token speech LLM that generates speech in three stages (text-to-semantic, semantic-to-coarse, and coarse-to-fine). To synthesize speech from a given speaker, Bark is prompted at all three levels with tokens from that speaker, following a standard zero-shot conditioning approach in speech LLMs.

The authors explicitly question whether these prompt tokens carry contextual information (e.g., prosody or semantics) in addition to speaker identity. They consider the case where prompt tokens are drawn from the prior utterance and investigate if Bark incorporates that context to produce more meaningful output. They test this by replacing the speaker prompt at the semantic-to-coarse level with encoded semantic tokens from a prior utterance and evaluate outcomes in listening tests.

Establishing whether prompt tokens encode contextual prosodic or semantic information is important for understanding in-context learning and control capabilities of discrete-token speech LLMs in TTS. It also informs whether supplying prior-utterance context via prompts can reliably improve synthesis quality or appropriateness.

References

However, it is not clear if the prompt tokens provide additional context besides speaker identity. For example, if the prompt tokens are from the prior utterance, does the model take into account the prosody or semantics in the prompt to generate more meaningful speech?

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model (2405.09768 - Wang et al., 16 May 2024) in Section 2.1 (Model Architecture)