PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS (2103.15060v3)

Published 28 Mar 2021 in cs.CL, cs.SD, and eess.AS

Abstract: This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.

Authors (5)

Ye Jia (33 papers)
Heiga Zen (36 papers)
Jonathan Shen (13 papers)
Yu Zhang (1400 papers)
Yonghui Wu (115 papers)

Citations (75)

View on Semantic Scholar

Summary

Overview of PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

The paper "PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS" presents a novel approach to improving text-to-speech (TTS) systems by leveraging both phoneme and grapheme information through a specialized model called PnG BERT. This approach builds upon the foundational BERT architecture, adapting it to function as a dual-input encoder for neural TTS applications. The primary contribution of the paper is the integration of phonemes and graphemes, pre-trained on large-scale text corpora, to achieve more natural speech synthesis, particularly in terms of prosody and pronunciation accuracy.

Model Architecture and Methodology

PnG BERT diverges from standard BERT by accepting inputs in the form of phonemes and graphemes, exploiting their complementary nature for enhanced language representations. This setup specifically addresses the limitations of phoneme-only models, which might suffer in representation ambiguity, especially in differentiating homophones. The dual-input setup of PnG BERT enables it to perform more sophisticated text encoding via word-level alignment between phoneme and grapheme inputs, a design inspired in part by cross-lingual LLMs like XLM.

The model undergoes a pre-training phase on vast corpora with a masked LLM (MLM) objective. A key innovation lies in the consistent masking method applied during pre-training, which uses whole-word masking to ensure that the model learns across both input types without inadvertently simplifying the MLM task. This strategic design ensures that the efficacy of the phoneme-grapheme relationship is fully exploited, enhancing the model’s ability to comprehend phonetic and textual nuances concurrently.

Experimental Evaluations and Results

Through extensive experiments, PnG BERT demonstrated substantial improvements over traditional TTS encoders. When incorporated into a neural TTS system, specifically NAT, PnG BERT exhibited superior prosody naturalness and pronunciation accuracy compared to baseline models. The model yielded mean opinion scores (MOS) that were statistically on par with professional human recordings. In side-by-side tests against these recordings, listeners displayed no significant preference, indicating a high degree of synthesized speech quality.

The paper outlines several ablation studies highlighting the importance of different components of the pre-training strategy. Models utilizing word-level alignment and consistent masking outperformed those without, underscoring the efficacy of these methods in enhancing language representation capabilities. These findings suggest that the benefits of PnG BERT primarily stem from improved natural language understanding rather than just phonetic conversion enhancements.

Implications and Future Directions

The insights gleaned from this research suggest that TTS systems could be substantially enhanced by integrating pre-trained LLMs that capitalize on dual phonetic and textual data. This approach not only improves the quality of synthesized speech but also demonstrates the potential for adapting mainstream LLMs, like BERT, for specialized applications in voice synthesis technology.

Future research can build upon these findings to explore multi-lingual TTS systems, potentially extending PnG BERT's methodology to incorporate additional language-specific phonetic and graphemic data. Moreover, examining the integration of this model into different architectures beyond NAT could unearth further adaptability and performance gains. As AI and NLP methodologies continue to evolve, the influence of models like PnG BERT is poised to shape the future of high-fidelity, natural-sounding TTS systems.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos