Text-Free Prosody-Aware Generative Spoken Language Modeling (2109.03264v2)

Published 7 Sep 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken LLMing (GSLM) \cite{Lakhotia2021} is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for LLMing and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken LLM (pGSLM). It is composed of a multi-stream transformer LLM (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm. Codes and models are available at https://github.com/pytorch/fairseq/tree/main/examples/textless_nlp/pgslm.

Authors (11)

Eugene Kharitonov (25 papers)
Ann Lee (29 papers)
Adam Polyak (29 papers)
Yossi Adi (96 papers)
Jade Copet (26 papers)
Kushal Lakhotia (15 papers)
Tu-Anh Nguyen (3 papers)
Morgane Rivière (26 papers)
Abdelrahman Mohamed (59 papers)
Emmanuel Dupoux (81 papers)
Wei-Ning Hsu (76 papers)

Citations (106)

View on Semantic Scholar

Summary

The paper introduces a novel pGSLM that integrates quantized prosodic features with phonetic modeling to improve expressive speech synthesis.
It employs a Multi-Stream Transformer and an adapted HiFi-GAN vocoder, demonstrating significant improvements in prosody and content modeling based on NLL metrics.
The study offers practical insights for developing inclusive, text-free speech systems, paving the way for more robust dialogue and multilingual applications.

Text-Free Prosody-Aware Generative Spoken LLMing

The paper "Text-Free Prosody-Aware Generative Spoken LLMing" presents a novel approach in the domain of spoken language processing, emphasizing the generative capabilities of speech models without relying on textual data. Traditional methods in NLP often involve converting speech to text via Automatic Speech Recognition (ASR) before processing. These methods have certain limitations due to the lack of text data for a majority of spoken languages and the loss of expressive features inherent to speech, such as prosody.

Overview

The existing framework, Generative Spoken LLMing (GSLM), has shortcomings because it primarily captures phonetic content while neglecting prosodic information crucial for generating expressive and coherent speech. Addressing this gap, the proposed Prosody-aware Generative Spoken LLM (pGSLM) integrates prosody with phonetic modeling, utilizing discovered units from self-supervised models and representing prosodic features through quantized fundamental frequency and duration.

The pGSLM includes:

Multi-Stream Transformer LLM (MS-TLM): This model jointly represents phonetic and prosodic streams, predicting upcoming speech segments.
HiFi-GAN Vocoder: An adapted HiFi-GAN is used to convert the MS-TLM outputs into speech waveforms.

Technical Insights and Results

The paper introduces metrics specific to prosody modeling, showing considerable improvements in prosody and content modeling when prosodic information is considered. Key results indicate:

The inclusion of prosody improves phonetic content modeling. Models that utilize both phonetic and prosodic information demonstrate better performance in terms of Negative Log-Likelihood (NLL) compared to phonetic-only models.
Prosodic input enhances generative tasks, enabling speech continuation that resonates with the prompt's prosodic cues.
Quantizing prosodic features provides models with better handling of multimodal distributions, which is crucial in generating expressive speech.

Implications

Practically, this approach broadens the scope of NLP and speech processing, allowing for the development of more inclusive dialogue systems and speech synthesis applications that can leverage prosodic features. With potential applications in automatic content generation and expressive speech synthesis, the paper underscores a shift from text-dependent models to robust text-free generative models that better mimic human speech.

Theoretically, the integration of prosody into spoken LLMs paves the way for more nuanced understanding and generation of speech, promoting advancements in self-supervised learning paradigms.

Future Directions

Future research might explore deeper integration of expressive features beyond prosody, enhancing emotion recognition, and semantic comprehension of speech. The development of models that can operate seamlessly across languages without large text corpora could significantly impact AI’s capabilities in speech-driven applications. Moreover, exploring cross-lingual and conversational AI applications might present opportunities to further refine and validate pGSLM's potential in varied contexts.

In conclusion, this work challenges the conventional separation between text-based and speech-based LLMing by demonstrating the feasibility and advantages of a text-free, prosody-enhanced approach to generative spoken LLMing.

PDF Markdown

Related Papers

GitHub

https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/pgslm

YouTube

Show All Videos