Contextual information encoded by Bark’s prompt tokens
Determine whether the prompt tokens used to condition Bark, a three-level discrete-token speech language model for text-to-speech synthesis, convey prosodic and semantic context beyond speaker identity—particularly when the prompt is taken from the prior utterance—and ascertain whether such contextual prompting leads the model to generate more meaningful speech.
Sponsor
References
However, it is not clear if the prompt tokens provide additional context besides speaker identity. For example, if the prompt tokens are from the prior utterance, does the model take into account the prosody or semantics in the prompt to generate more meaningful speech?
— Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
(2405.09768 - Wang et al., 16 May 2024) in Section 2.1 (Model Architecture)