Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions (2301.08810v1)

Published 20 Jan 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Large-scale pre-trained LLMs have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

PDF Abstract

This paper introduces a novel approach to improving text-to-speech (TTS) systems by developing a phoneme-level BERT model, referred to as PL-BERT. The main goal is to enhance the naturalness and expressiveness of synthesized speech, focusing specifically on prosody, which includes the tone and rhythm of speech. Let's break down the key components and findings of this research.

Background and Importance

TTS systems are designed to convert written text into spoken words. A key challenge in TTS is capturing the natural rhythms and emotional nuances of human speech. Traditional models often need large datasets to learn effectively, and they may still fall short in generating natural-sounding prosody, especially when faced with out-of-distribution (OOD) texts that differ from the training data. The paper aims to bridge this gap by using a LLM that focuses on phoneme-level predictions, which are fundamental units of sound in speech.

Phoneme-Level BERT (PL-BERT) Model

Phoneme vs. Grapheme: The model focuses solely on phonemes rather than graphemes or other larger units like words. Phonemes are basic sound units, whereas graphemes are letters or combinations of letters. This reduction aims to simplify the model and make it more efficient.
Pre-training Strategy: PL-BERT is trained similar to traditional BERT models but adapted for phoneme inputs. It predicts masked phoneme tokens (MLM) and the graphemes they correspond to (P2G). This dual objective helps the model learn a rich representation of speech sounds and their linguistic context.
Masking Strategy: The model uses a whole-word masking strategy for phonemes, which helps it understand and predict the context of entire words rather than isolated phonemes.

Experiments and Results

Datasets: The model was pre-trained on a large corpus of English texts and fine-tuned using the LJSpeech dataset, which is commonly used for TTS tasks.
Performance: PL-BERT showed significant improvements in prosody and naturalness over existing models, particularly with unseen or out-of-distribution texts. It performed better than the Mixed-Phoneme BERT (MP-BERT) model, which uses a combination of phonemes and sup-phonemes.
Evaluation Metrics: Mean Opinion Scores (MOS) were used to evaluate speech naturalness, with PL-BERT achieving higher MOS scores compared to baseline models such as StyleTTS without BERT.

Conclusion

PL-BERT enhances TTS systems by focusing on the language's phoneme level, resulting in more natural and prosodic speech synthesis, especially for texts that weren't part of the training dataset. This approach reduces computational complexity by eliminating the need for large representations of graphemes, offers faster training and inference, and addresses limitations present in previous models.

Future Directions

The research suggests focusing more on adapting TTS models to better handle OOD texts, as real-world applications often require this flexibility. The reduction of in-distribution bias remains a priority for further development in TTS systems.

Through these advancements, the paper makes a significant contribution to the field of TTS, highlighting the potential of phoneme-level processing in improving the naturalness and expressiveness of machine-generated speech.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Yinghao Aaron Li (15 papers)
Cong Han (27 papers)
Xilin Jiang (17 papers)
Nima Mesgarani (45 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Audio Samples from PL-BERT

Tweets

https://twitter.com/zhodari/status/1762524636380536964