Pheme: Efficient and Conversational Speech Generation (2401.02839v1)

Published 5 Jan 2024 in eess.AS, cs.AI, and cs.CL

Abstract: In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with LLMs might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

PDF HTML Abstract

Efficient and Conversational Speech Generation

This paper addresses the challenges of developing efficient and conversational Text-to-Speech (TTS) systems, proposing a new model series aimed at synthesizing natural, human-like speech in real time, with significant improvements in data efficiency and inference speed. Current state-of-the-art TTS models such as VALL-E and SoundStorm require substantial neural architectures and large-scale datasets, making them impractical for real-time applications like assistive conversational systems. The model introduced in this work provides a compact yet high-performing alternative, achieving similar audio quality with over ten times less training data.

Key Contributions

Compact and High-Performance Models: The proposed approach enhances the efficiency of TTS models by employing a streamlined architecture that maintains high performance. This is achieved by leveraging smaller datasets of conversational speech, reducing the data demand significantly.
Parallel Speech Generation: The new model utilizes non-autoregressive parallel decoding strategies, inspired by MaskGIT-style inference, to enhance inference speed without compromising audio quality. This parallel processing approach contrasts with the autoregressive nature of previous models, dramatically reducing latency times.
Teacher-Student Distillation for Voice Quality Improvement: By employing a teacher-student distillation approach with larger models, naturalness and voice quality enhancements are realized even over smaller datasets. This data-efficient method allows for effective model specialization to single-speaker scenarios using synthetic data generated by third-party providers.

Evaluation and Results

The model is evaluated on various aspects of TTS, including richness and naturalness of prosody, intelligibility of speech, and inference efficiency:

Word Error Rate (WER): The proposed model achieves a WER of 12.4%, demonstrating improved intelligibility over comparable models like MQTTS with a WER of 14.2%.
Speaker Similarity Score (SSS): Despite lower SSS compared to MQTTS (0.594 versus 0.682), the model remains competitive, preserving essential speaker characteristics.
Mel-cepstral Distortion (MCD): The model showed a promising MCD of 8.838, indicating better reconstruction fidelity.
Fréchet Inception Distance (FID): A FID of 20.349 illustrates superior generation diversity and naturalness.

In terms of efficiency, the model shows a remarkable improvement, achieving a Real-Time Factor (RTF) of 0.133 for sentence processing, substantially cutting down the processing times required by prior solutions like MQTTS.

Implications and Future Directions

The implications of this work are significant across both academic and industrial domains. In practice, the model can be effectively applied to build real-time conversational systems such as voice assistants, where user satisfaction is heavily dependent on the naturalness and immediacy of interactions. The theoretical contributions lie in evidence that robust TTS systems can be developed with minimalist designs yet maintaining high fidelity.

Future exploration could focus on refining the T2S components to reduce computational loads further or applying newer efficient parallel decoding strategies. Incorporating multilingual capabilities or synthesizing longer utterances could address broader application needs. Moreover, expanding publicly available high-quality conversational datasets would bolster research in this area, enabling further innovation in TTS technology.

By providing a high-performing, data-efficient TTS framework, this research sets a new precedent for future developments in speech synthesis, challenging the paradigm of resource-intensive model design.