Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generative Spoken Language Modeling

Updated 17 February 2026
  • Generative Spoken Language Modeling is a textless approach that jointly learns acoustic, prosodic, and linguistic structures directly from raw speech data.
  • It employs a three-stage process—speech-to-unit, unit-level language modeling, and unit-to-speech reconstruction—using self-supervised encoders and neural vocoders.
  • Recent advances incorporate prosodic modeling, dual-channel dialogue architectures, and flow-based techniques to enhance robustness, coherence, and low-resource language processing.

Generative Spoken Language Modeling

Generative Spoken Language Modeling (GSLM) investigates the generation of speech directly from audio representations, eschewing text intermediates such as orthographic transcription or phonemic labels. This approach leverages self-supervised encoders, discrete or continuous tokenization, and neural autoregressive or flow-based models to jointly capture the acoustic, prosodic, and linguistic structure of natural spoken language. The field encompasses models for both monologue and dialogue, integrates paralinguistic and prosodic modeling, and addresses foundational challenges in representation learning, robustness, and long-form generation.

1. Core Principles and Problem Definition

Generative spoken language modeling is defined as the task of learning both the acoustic–phonetic and higher-level compositional (linguistic) structures of a language using only raw speech data (Lakhotia et al., 2021, Park et al., 2023). Unlike conventional approaches that require paired speech–text data or phoneme inventories, GSLM operates in a "textless" fashion: both analysis and synthesis use learned audio-derived units. The fundamental paradigm is a three-stage system:

  1. Speech-to-Unit (S2U): Self-supervised encoders (e.g., HuBERT, wav2vec 2.0, CPC) map audio to frame-level continuous features, which are then quantized into discrete units (typically via k-means).
  2. Unit-Level LLM (uLM): A neural LM (Transformer, LSTM) autoregressively models sequences of these units, optionally conditioned on prompts or context.
  3. Unit-to-Speech (U2S): A neural vocoder or sequence decoder reconstructs waveforms from unit sequences.

The generative objective is typically the maximization of sequence likelihood: p(x1:T)=t=1Tp(xtx<t)p(x_{1:T}) = \prod_{t=1}^T p(x_t \mid x_{<t}) where x1:Tx_{1:T} denotes the speech token sequence, which may be discrete (unit IDs) or continuous (embeddings).

Motivation includes universal spoken language processing (especially for unwritten or low-resource languages), bypassing the bias of written forms, and capturing full acoustic and paralinguistic richness (Lakhotia et al., 2021, Park et al., 2023).

2. Tokenization and Acoustic Representation

The performance and interpretability of generative SLMs depend critically on the token representation. Standard practice is to extract framewise features (20–40 ms hop) with self-supervised encoders and quantize them:

Discrete Units

  • Discrete unit systems (pseudo-phones) are obtained by clustering (k-means or learned quantizers) high-level embeddings. These "units" strongly correlate with phoneme identity but are not in 1-to-1 correspondence. Redundancy and lack of robustness to signal distortions are recurring issues, prompting the development of redundancy metrics (e.g., circular resynthesis) and robust quantization via augmentation-invariant training (Sicherman et al., 2023, Gat et al., 2022).
  • Codebook sizes of 50–200 are common, balancing acoustic detail and linguistic tractability (Lakhotia et al., 2021).

Continuous Embeddings and Word-Sized Tokens

  • Recent models introduce continuous, "word-sized" tokens. Segmentation into ∼200 ms intervals (using SSE models) allows each chunk to be mapped to a continuous embedding and then projected to a lexical space through quantization and a deep projector (Algayres et al., 2023).
  • This facilitates k-NN-based sampling rather than multinomial over fixed vocabularies. Lexical embeddings organize along semantic axes, yielding interpretable structure with significant memory savings.

Prosodic and Paralinguistic Features

  • Explicit modeling of prosody (duration, pitch) is achieved via parallel streams and multi-stream architectures; prosodic features can be treated as quantized or continuous variables and jointly predicted alongside content units (Kharitonov et al., 2021).
  • Paralinguistic attributes (sentiment, emotion, style) are encoded as special tokens or learned continuous attributes. End-to-end variational approaches introduce latent continuous variables trained via VAE or normalizing flows, obviating the need for hand-crafted prosodic features (Chen et al., 17 Jun 2025, Lin et al., 2023).

3. Model Architectures and Multimodal Integration

Single-Channel Generative Models

  • Baseline GSLMs consist of an autoregressive transformer over unit or lexical embeddings, trained with cross-entropy or contrastive (InfoNCE) losses (Lakhotia et al., 2021, Algayres et al., 2023).
  • Prosody- and paralinguistics-aware LMs employ multi-stream transformers, MS-TLM, which sum learned embeddings for content units, duration, and pitch, and jointly minimize multi-headed losses (Kharitonov et al., 2021).
  • End-to-end variational models incorporate both semantic tokens and learned continuous paralinguistic features (jointly modeled via VAE, AR prior, and diffusion decoders), improving naturalness while preserving content (Chen et al., 17 Jun 2025).

Dialogue and Dual-Channel Models

  • Dialogue modeling advances include dual-tower transformers (dGSLM) with shared weights and cross-attention between two speakers' unit streams, allowing the modeling of turn-taking, overlaps, and paralinguistic events without explicit annotation (Nguyen et al., 2022).
  • Next-Token-Pair Prediction (NTPP) proposes a decoder-only architecture modeling joint distributions over both speakers' tokens, employing pairwise masking to control attention and enable speaker-independent, real-time dialogue generation. This setup offers state-of-the-art turn-taking and naturalness (Wang et al., 1 Jun 2025).
  • Pseudo-stereo data augmentation, transforming single-channel mixed dialogue into two-channel data via diarization, separation, and speaker verification, provides large-scale training data critical for semantically coherent generation (Fu et al., 2024).

Prompt-Tuned and Long-Form Architectures

  • Prompt-tuning frameworks enable efficient adaptation of large speech LMs to generative tasks using continuous soft prompt vectors injected at all transformer layers, optimizing only a small fraction of parameters (Wu et al., 2023).
  • State-space models (SSM, Griffin) scale spoken language generation to hour-long continuations in bounded memory, supporting long-form coherent audio generation (Park et al., 2024).

Joint Linguistic–Acoustic Models

  • Flow-SLM jointly generates discrete semantic tokens and real-valued acoustic representations at each timestep via a conditional flow-matching objective. Multi-step semantic-token prediction is critical for decoupling acoustic smoothness from linguistic content, ensuring preservation of both linguistics and prosody (Chou et al., 12 Aug 2025).

4. Training Objectives, Losses, and Inference

  • Discrete autoregressive models minimize cross-entropy LM loss over token sequences.
  • Prosody- and paralinguistic-aware models use joint, weighted cross-entropy or combine cross-entropy for content and regression (L1/L2, KL, or Laplacian) for prosodic/continuous attributes.
  • Variational models maximize the ELBO, balancing reconstruction and KL divergences for both discrete and continuous latents (Chen et al., 17 Jun 2025).
  • Contrastive losses (InfoNCE) over continuous lexical embeddings enforce discrimination using positive and negative keys for tokens to be predicted (Algayres et al., 2023).
  • During inference, generative sampling is performed autoregressively (token-by-token or token-pair-by-token-pair for dialogue), sometimes augmented with k-NN search in embedding space (Algayres et al., 2023).

5. Evaluation Protocols and Metrics

Evaluation spans acoustic, linguistic, prosodic, paralinguistic, and generative dimensions:

Aspect Standard Metric(s) Key Purpose
Acoustic ABX (phoneme discrimination) Unit purity, phonetic coverage
Linguistic sWUGGY, sBLIMP, VERT, PPX Lexical/grammatical modeling
Prosodic F₀-RMSE, pitch/duration MAE Prosody consistency, diversity
Paralinguistic Sentiment (accuracy), MOS, BLEU Attribute identification/generation
Generation CER, WER, ASR-PPL, MOS, auto-BLEU Realism, intelligibility, diversity
Turn-taking IPU, gap, overlap event stats Dialogue naturalness/coherence

Recent work critically re-assesses global token perplexity, showing its misalignment with human-rated quality for speech, and proposes windowed, normalized perplexity, embedding-based “judge” metrics, and stratified MOS to better reflect perceived generation quality, especially for local acoustic plausibility (Sju et al., 9 Jan 2026). Embedding-based reference metrics (SBERT, Gecko) and LLM-judge side-by-side evaluations are used for long-form generation (Park et al., 2024).

6. Empirical Results and Model Comparisons

  • Discrete SLMs (GSLM, pGSLM) achieve best ABX in the 5–7% range, with MMOS = 4.0–4.3, and meaningfulness MOS competitive with text+TTS toplines (Lakhotia et al., 2021, Kharitonov et al., 2021).
  • Redundancy-aware and augmentation-invariant unit discovery reduces ABX by 10–30% relative to naive k-means and improves speech-to-speech translation BLEU by 2–3 points (Sicherman et al., 2023, Gat et al., 2022).
  • Variational and flow-based models (Flow-SLM, VAE-GSLM) yield comparable sWUGGY/sBLIMP lexical modeling but significantly improve acoustic consistency (speaker-similarity rises from 0.09 to 0.4), F₀-RMSE drops from 43.9 to 16.6 Hz, and naturalness-MOS increases by up to 0.4 (Chou et al., 12 Aug 2025, Chen et al., 17 Jun 2025).
  • Paralinguistics-aware dialogue generation (ParalinGPT) boosts sentiment accuracy by +6.7%/12% over text-only baselines and increases BLEU by 3.5% (Lin et al., 2023).
  • Dual-channel dialogue SLMs (dGSLM, NTPP) achieve near-human MOS for turn-taking and naturalness (NTPP: 3.95–4.15; ground-truth: 4.90), with NTPP providing sub-200 ms inference latency, outperforming earlier cross-attention and encoder–decoder models (Wang et al., 1 Jun 2025, Nguyen et al., 2022).
  • Long-form generation (SpeechSSM) maintains coherent speech for up to 16 minutes, matching human reference in ASR-PPL and MOS time-stratified naturalness (Park et al., 2024).

7. Open Challenges and Future Directions

  • Unit representation: Ongoing refinement of unit discovery (robustness to signal transformations, redundancy removal, scaling to multilingual) remains critical (Gat et al., 2022, Sicherman et al., 2023).
  • Prosody and paralinguistics: Richer unsupervised modeling of emotion, style, and speaker traits is a major objective. End-to-end continuous latent variables and instruction-tuned LLMs are promising (Chen et al., 17 Jun 2025, Lin et al., 2023).
  • Dialogue and contexts: Speaker turn-taking, reflecting pauses, overlaps, and floor transfers, require large dual-channel datasets (pseudo-stereo and real dialog). Improving semantic coherence beyond event statistics is an open challenge (Fu et al., 2024, Wang et al., 1 Jun 2025).
  • Long-form speech: Unbounded and resource-efficient long-form speech generation with global semantic planning, hierarchical state-space, and minimal drift is an emerging research focus (Park et al., 2024).
  • Evaluation: Adopting embedding-driven, local, and normalized perplexity metrics is necessary to accurately track model progress, especially as models approach human-level MOS and acoustic variability (Sju et al., 9 Jan 2026).
  • Biological and cross-species modeling: Extensions beyond human speech, as in marmoset vocalization modeling, demonstrate generality and facilitate neuroscientific investigation of vocal communication (Sternberg et al., 11 Sep 2025).
  • Unified and scalable pipelines: Integrating speech and text representations, supporting multilingual and low-resource settings, and reducing reliance on text for model selection and evaluation remain central goals.

GSLM forms the basis for "foundation models" in spoken language, enabling speech-native dialogue agents, robust spoken translation, and language processing in data regimes where orthographic resources are unavailable or insufficient.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Spoken Language Modeling.