FastPitch: Parallel Text-to-speech with Pitch Prediction (2006.06873v2)

Published 11 Jun 2020 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture, with over 900x real-time factor for mel-spectrogram synthesis of a typical utterance.

Citations (316)

View on Semantic Scholar

Summary

The paper demonstrates a novel parallel TTS architecture by integrating pitch prediction for enhanced speech expressiveness.
It leverages Transformer-based feed-forward networks and 1-D CNNs to predict pitch and duration with high accuracy.
Empirical results show over 900x GPU and 100x CPU real-time synthesis, making it efficient for interactive TTS applications.

Examination of FastPitch: A Parallel Text-to-Speech Model with Pitch Prediction

The paper "FastPitch: Parallel Text-to-Speech with Pitch Prediction" introduces an innovative approach to neural text-to-speech (TTS) synthesis, proposing a model that extends upon the FastSpeech architecture by incorporating fundamental frequency (F0) contours into the synthesis process. This enhancement allows the model to predict pitch contours during inference, granting the ability to manipulate these contours, thereby augmenting the expressiveness and semantic alignment of generated speech.

Model Architecture and Methodology

FastPitch diverges from traditional TTS models by utilizing a fully parallel structure, drawing on the Transformer architecture intrinsic to FastSpeech. The model employs two key feed-forward Transformer (FFTr) stacks: one for the resolution of input tokens and another for output frames. The prediction of pitch and duration is facilitated through a 1-D CNN, with the novel integration of pitch prediction standing as a central innovation. Pitch is projected to align with the hidden representation of input tokens, and is integrated into the model's processing pipeline, allowing for real-time adjustments and interactive pitch modulation.

During training, FastPitch operates on graphemes and phonemes, utilizing ground truth pitch and duration values while optimizing mean-squared error (MSE) for these variables. A key advantage highlighted is its resilience to alignment inaccuracies, where variations in alignment models do not significantly affect the quality of synthesized speech.

Empirical Results

In terms of numerical results, FastPitch demonstrates impressive performance metrics, showing a mean opinion score (MOS) indicative of high-quality speech synthesis that rivals existing autoregressive models. Notably, its ability to execute mel-spectrogram synthesis with a real-time factor (RTF) exceeding 900x on GPU hardware, and over 100x on CPU, underscores its efficiency and suitability for latency-sensitive applications.

The paper further discusses the scalability of FastPitch to multi-speaker scenarios, illustrating its adaptability and solid performance in generating speech from additional speaker data with considerable quality, comparable to or surpassing that of Tacotron 2 and Flowtron models.

Implications and Future Directions

The ability of FastPitch to manipulate pitch contours opens numerous possibilities in practical applications, where altering perceptual qualities such as pitch, expressiveness, and speaker characteristics can enhance user interaction in digital content production, assistive technologies, and other domains. Moreover, the model's efficiency, combined with its expressiveness, presents a formidable tool for developers and researchers working on interactive TTS systems.

Theoretically, this research enriches the field's understanding of how conditioning on prosodic features like pitch can improve the convergence behaviors and overall quality of speech synthesis models. It posits a simpler yet effective means of leveraging pitch information, paving the way for future research to explore deeper integration of prosodic elements, potentially including higher-order speech features or more complex linguistic content.

In conclusion, FastPitch stands as a robust contribution to the domain of TTS, offering substantial advancements in synthesis quality and speed, with the added benefit of real-time prosodic control. Looking ahead, adaptations to accommodate additional linguistic features or fine-tuning pitch adjustments in more granular contexts could further enhance its application scope and efficacy.

PDF Markdown

Related Papers

YouTube

Show All Videos