Generative Spoken Language Modeling from Raw Audio (2102.01192v2)

Published 1 Feb 2021 in cs.CL

Abstract: We introduce Generative Spoken LLMing, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative LLM (trained on pseudo-text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder-dependent way, and that some combinations approach text-based systems.

Authors (11)

Kushal Lakhotia (15 papers)
Evgeny Kharitonov (5 papers)
Wei-Ning Hsu (76 papers)
Yossi Adi (96 papers)
Adam Polyak (29 papers)
Benjamin Bolte (5 papers)
Tu-Anh Nguyen (3 papers)
Jade Copet (26 papers)
Alexei Baevski (39 papers)
Adelrahman Mohamed (1 paper)
Emmanuel Dupoux (81 papers)

Citations (312)

View on Semantic Scholar

Summary

On Generative Spoken LLMing from Raw Audio: An Expert Overview

The paper "On Generative Spoken LLMing from Raw Audio" presents an innovative approach to developing LLMs without the traditional reliance on text or labeled data. This work introduces the task of Generative Spoken LLMing which aims to learn acoustic and linguistic characteristics purely from raw audio. The focus is on bridging the gap between speech processing and NLP by facilitating the development of systems that mimic the natural way humans learn languages from spoken input.

Core Contributions and Methodology

System Architecture: The proposed solution is a pipeline that consists of three components:
- A Speech-to-Unit (S2u) Model that encodes speech into discrete pseudo-text units.
- A Unit-LLM (uLM) that models language distributions from these pseudo-text units.
- A Unit-To-Speech (u2S) Model that decodes pseudo-text back into speech.
Evaluation Framework: The authors introduce a novel set of evaluation metrics for assessing both encoding and generation capabilities at the acoustic and linguistic levels. For instance, acoustic evaluations utilize ABX scores for phonetic unit discrimination, while linguistic evaluations employ measures like spot-the-word accuracy for lexical aspects.
Comparative Analysis: Across multiple baselines and configurations, including three speech encoders—Contrastive Predictive Coding (CPC), wav2vec 2.0, and HuBERT, the analysis examines the impact of varying codebook sizes (50, 100, 200 units) on performance.
Human and Machine Metrics Correlation: An interdisciplinary evaluation involving human measures (Mean Opinion Score for intelligibility and Meaningfulness MOS for naturalness) is conducted alongside machine metrics (such as Phone Error Rate from pretrained ASR models). The strong correlation between human-annotated scores and ASR-derived metrics substantiates the validity of the proposed evaluation framework.
Public Resources: To support reproducibility and future advancements, all the evaluation tools and selected baseline models are made publicly available, encouraging transparency and comparability in subsequent research endeavors.

Key Findings

Effect of Quantization: The paper highlights the dependency of system performance on the number of discrete units used to encode audio. While finer quantizations (200 units) yield better results in speech resynthesis, a balance must be struck for generative language tasks to account for linguistic abstraction without overfitting to phonetic detail.
Encoder and Task Dependency: The choice of speech encoder and the task at hand significantly impact the performance outcome. HuBERT notably performs well across several assessments, particularly in generating coherent speech, suggesting its efficacy for higher-order LLMing tasks.
Automatic vs. Human Evaluation: The introduced ASR-based metrics and their strong correlation with human-evaluated scores establish these automatic methods as viable proxies for comprehensive model evaluation.

Implications and Future Directions

The ability to generate speech directly from raw audio inputs without intermediate text takes a significant step towards accommodating the many languages lacking extensive orthographic resources. By demonstrating feasible 'textless NLP,' this research can lead to more inclusive AI technologies adept at handling diverse linguistic contexts.

Potential future developments include refining these models' abilities to capture prosodic and contextual nuances beyond phonetic content and exploring the transferability of these architectures to truly low-resource, non-written languages. Further, optimizing LLMs to leverage growing datasets of natural, spoken interactions, such as those from social media or voice-enabled AI platforms, could significantly enhance the robustness of generative spoken LLMing.

In summary, this paper provides substantial groundwork for future work in generative LLMing from an audio-centric perspective, aiming to democratize NLP technologies and bridge linguistic resource gaps worldwide.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sedielem/status/1857564251960778964

YouTube

Show All Videos