Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (2302.03540v1)

Published 7 Feb 2023 in cs.SD and eess.AS

Abstract: We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.

Authors (9)

Eugene Kharitonov (25 papers)
Damien Vincent (25 papers)
Zalán Borsos (18 papers)
Sertan Girgin (24 papers)
Olivier Pietquin (90 papers)
Matt Sharifi (9 papers)
Marco Tagliasacchi (37 papers)
Neil Zeghidour (39 papers)
Raphaël Marinier (5 papers)

Citations (164)

View on Semantic Scholar

Summary

The paper introduces SPEAR-TTS, decoupling text-to-speech into semantic and acoustic token tasks to achieve high fidelity with minimal supervision.
It leverages large-scale unlabeled audio along with pretraining and backtranslation, significantly reducing the need for extensive parallel datasets.
Experimental results show a CER of 1.92% and MOS of 4.96, demonstrating its potential for scalable TTS applications, particularly in low-resource languages.

Overview of SPEAR-TTS: High-Fidelity Text-to-Speech with Minimal Supervision

The paper introduces SPEAR-TTS, an advanced multi-speaker text-to-speech (TTS) system, which distinguishes itself by requiring minimal supervision during training while maintaining high fidelity and voice diversity. This is achieved through an innovative architecture that decouples TTS into two distinct sequence-to-sequence tasks using different types of discrete speech representations. The approach significantly reduces the reliance on extensive parallel transcribed datasets.

Key Methodological Advances

The authors propose to model TTS as a composition of two main tasks:

Translation from text to high-level semantic tokens, akin to "reading."
Translation from semantic tokens to low-level acoustic tokens, akin to "speaking."

This decoupling allows for the exploitation of large-scale unlabeled audio-only datasets to train the "speaking" module independently, shifting the heavy lifting of diverse speaker representation from transcribed datasets to more abundant audio datasets. Additionally, the "reading" component benefits from pretraining and backtranslation techniques to mitigate the need for large parallel datasets.

Novelty and Implementation

SPEAR-TTS utilizes two distinct types of speech tokens:

Semantic tokens: Derived using self-supervised speech models (e.g., w2v-BERT), these capture the linguistic content while filtering paralinguistic factors.
Acoustic tokens: Derived from a SoundStream neural codec, these provide a high-fidelity representation suitable for audio waveform synthesis.

By employing this dual-token approach, the model aligns itself with a LLM framework, allowing it to utilize methodological advances from the broader machine translation domain such as Transformers, BART/T5-style pretraining, and backtranslation.

Moreover, to achieve speaker voice control, SPEAR-TTS employs an example prompting mechanism analogous to techniques used in LLMs. It enables synthesis of an unseen speaker's voice using merely a short 3-second audio sample without speaker IDs or explicit embeddings.

Results and Implications

Experimental results demonstrate that SPEAR-TTS produces high-quality, natural-sounding speech with minimal errors (CER of 1.92%) even when only 15 minutes of parallel data is utilized. This is compelling when compared against established baselines like FastSpeech2-LR. The model's ability to deliver high fidelity speech (MOS 4.96) in low-data scenarios emphasizes its potential for low-resource languages or dialects where traditional parallel datasets are scarce.

From a practical viewpoint, SPEAR-TTS implies a significant reduction in the cost of data collection for TTS systems, offering a scalable solution applicable to various linguistic settings. Its capacity for high-quality voice synthesis from an extremely limited dataset represents a striking departure from conventional TTS frameworks, which usually require extensive labeled data.

Future Prospects

The modular and efficient architecture of SPEAR-TTS lays a promising foundation for extending TTS technologies to underrepresented languages. The methodology could be further refined to explore better integration with multilingual systems or more sophisticated self-supervised learning paradigms. Additionally, advancements might focus on enhancing robustness to diverse environmental conditions in audio data, addressing challenges in spontaneous speech synthesis, and refining speaker adaptation techniques to multifaceted vocal characteristics.

In sum, SPEAR-TTS marks an essential stride in making TTS more accessible, heralding new possibilities in linguistic inclusion and vocal applications, potentially intriguing the AI and speech processing communities toward novel directions in unsupervised and semi-supervised learning domains.

PDF Markdown