Natural language guidance of high-fidelity text-to-speech with synthetic annotations (2402.01912v1)

Published 2 Feb 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech LLM. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.

References (3)

Authors (2)

Dan Lyth (1 paper)
Simon King (28 papers)

Citations (23)

View on Semantic Scholar

Summary

The paper presents a novel approach enabling natural language control of TTS attributes without relying on reference recordings.
The methodology efficiently annotates a 45k-hour dataset with synthetic labels, achieving 94% gender classification accuracy and high-quality audio output.
Evaluation shows synthesized speech rated superior to ground truth audio, indicating transformative potential for scalable, natural TTS systems.

Introduction

The field of text-to-speech (TTS) synthesis has seen rapid advancements thanks to the availability of large-scale datasets and models. Traditionally, control over speaker identity and style in TTS systems has been limited by the need for reference speech recordings, but there's a growing interest in the natural language prompting of these attributes. This paper introduces a novel approach that combines the strengths of both methods—achieving natural language control of speaker identity and style without relying on reference samples or the scalability limits of human annotations.

Related Work

The researchers conduct a thorough analysis of existing methods that seek to control non-lexical speech information, such as identity and style. These efforts range from statistical measures applied to speech embeddings to more flexible, yet complex, latent variable modeling. Notably, several concurrent projects strive for natural language description in TTS but are limited either by the scope of their control or by the reliance on extensive human annotation to achieve desirable variation in speech.

Methodology

The emphasis of the proposed methodology is on efficient and scalable labeling processes. The authors effectively annotate a massive 45k hour dataset with speech attributes, including gender, accent, pitch, speaking rate, and recording conditions. A classifier and multiple computational methods are used in place of human labeling; this information is then fed into a LLM to enhance the naturalness of the derived descriptions. Additionally, the research introduces a high-fidelity speech synthesis method that leverages a state-of-the-art audio codec model to outperform current alternatives.

Results and Evaluation

The paper presents robust numerical evaluations, underscoring significant improvements in speech synthesis fidelity and naturalness. For instance, a gender classification accuracy of 94% is achieved purely through the model's learning, and despite only 1% high-fidelity audio in the training data, the audio produced closely approaches professional recording standards. Notably, subjective listening tests rated the synthesized speech higher than ground truth audio, potentially due to minor errors in label accuracy or artifacts in the original recordings. The research concludes with an outline of future directions, pointing towards expanding the linguistic and stylistic diversity of the traditionally recorded text-to-speech output, portending a wider application of the model's capabilities across different languages and contexts.

Related Papers

Tweets

https://twitter.com/danlyth/status/1754823375208280430

https://twitter.com/ArxivSound/status/1754733868953768404

https://twitter.com/huseinzol05/status/1837114688351400275

https://twitter.com/anuj_diwan/status/1762610707147731176

https://twitter.com/fluffykittnmeow/status/1798072694447263943

https://twitter.com/WilliamLamkin/status/1755240143189914110

YouTube

Show All Videos