Textless Spoken Language Modeling
- Textless spoken language modeling is a framework that learns the statistical structure of spoken language directly from raw audio using unsupervised unit discovery.
- It employs discrete bottlenecks and transformer-based language models to capture lexical, syntactic, and prosodic features without relying on orthographic text.
- This approach enables robust applications in zero-resource languages, dialogue systems, prosody transfer, and speech-to-speech translation while addressing challenges in noise robustness and semantic depth.
Textless spoken language modeling (SLM) denotes a class of generative and predictive models that learn the statistical structure of spoken language directly from raw audio, in the absence of—or without reliance on—orthographic text. Unlike conventional automatic speech recognition (ASR) or speech-to-text-to-speech pipelines, textless SLMs discover and operate on a sequence of learned acoustico-linguistic units, serve as LLMs over these units, and often map back to speech with a generative decoder. This approach enables both language processing and generation in truly zero-resource conditions, including unwritten or low-resource languages, and offers a laboratory for testing cognitive, neuroscientific, and engineering hypotheses about language acquisition and spoken interaction.
1. Core Principles: Discrete Units, Bottlenecking, and Model Structure
A central tenet of textless SLM is the imposition of a time-discretized, finite-symbol (“unit”) bottleneck between raw audio and language modeling, defined as follows:
- Unit Discovery: An acoustic encoder (e.g., HuBERT, CPC, wav2vec 2.0, SpidR) transforms raw audio into frame-wise high-dimensional features . Unsupervised quantization (typically -means, vector quantization, or self-distillation with online clustering) maps to unit indices . This “pseudo-text” forms the vocabulary for downstream modeling (Lakhotia et al., 2021, Poli et al., 23 Dec 2025).
- Discrete Bottleneck: Discretization suppresses non-linguistic variation (speaker, channel, background noise), removing information that is detrimental to robust language modeling and yielding gains of 20+ points on lexical, up to 6 points on syntactic metrics, relative to continuous representations (Nguyen et al., 2022). ABX and PNMI metrics have been validated as high-correlation proxies for downstream SLM performance (Poli et al., 23 Dec 2025).
- Unit-based Language Modeling: Transformer LMs (causal/decoder-only or masked/encoder-style) are trained to predict the next (or masked) unit, with cross-entropy or masked-prediction losses (Lakhotia et al., 2021, Nguyen et al., 2022). For dialogue or sequence-to-sequence settings, multi-stream architectures and cross-attention mechanisms are introduced (Nguyen et al., 2022, Mai et al., 8 Jan 2025).
Recent models—e.g., Flow-SLM (Chou et al., 12 Aug 2025)—extend this by jointly modeling both discrete semantic tokens and continuous acoustic vectors, thus learning “what to say” and “how to say it” in an integrated framework.
2. Modeling Architectures and Advancements
Textless SLM architectures have evolved from modular, cascaded pipelines to highly integrated systems that unify prosody, style, dialogue, and translation. Representative architectural schematics include:
- Classic GSLM Pipeline: (i) Self-supervised encoder (e.g. HuBERT), (ii) discretizer (e.g. k-means), (iii) Transformer LM, (iv) unit-to-speech decoder (Tacotron2, HiFi-GAN) (Lakhotia et al., 2021, Kharitonov et al., 2022). Modular design allows for independent experimentation with each component.
- Multi-stream and Prosody-Aware Models: Models like pGSLM jointly autoregress over units, durations, and (quantized) log-pitch, achieving both prosodic coherence and content fidelity (Kharitonov et al., 2021).
- Real-Time and Dialogue Generation: Streaming models such as RTTL-DG process multi-speaker dialogs with learned discrete units, fusing turn-taking, paralinguistics (laughter, fillers), and response generation into a low-latency end-to-end system (Mai et al., 8 Jan 2025, Lu et al., 1 Jan 2025).
- Multitask and Translation Systems: Multitask LMs like MSLM-S2ST natively support speech-to-speech translation (S2ST), integrating semantic-unit translation, acoustic-unit generation, and speaker style preservation—conditioned solely on unit streams, with no text supervision (Peng et al., 19 Mar 2024). “Unit language” representations by -gram segmentation enable further cross-lingual and cross-modal learning (Zhang et al., 21 May 2025).
Advanced frameworks couple these architectures with novel training objectives, such as preference optimization via Direct Preference Optimization (DPO) utilizing LLM-based feedback to improve semantic alignment (Lin et al., 4 Nov 2024). Flow-matching objectives allow for continuous acoustic generation conditioned on semantic structure (Chou et al., 12 Aug 2025).
3. Evaluation Methodologies and Empirical Results
Benchmarking and evaluation in textless SLM target several axes:
- Acoustic Discriminability (ABX): Frame-level triphone discrimination rates, both within and across speakers.
- Lexical (sWUGGY), Syntactic (sBLIMP), Semantic (tSC, sSIMI) Measures: Pairs of real vs. non-words, grammatical vs. ungrammatical sentence-pairs, and story ending coherence are scored using log-likelihoods from the SLM (Dunbar et al., 2021, Lin et al., 4 Nov 2024, Poli et al., 23 Dec 2025).
- Resynthesis (ASR-BLEU, WER, MOS): Generated speech is transcribed by a downstream ASR and compared to ground truth. Subjective human naturalness (N-MOS), meaningfulness (M-MOS), and prosodic quality (P-MOS) are also reported (Lakhotia et al., 2021, Kharitonov et al., 2021).
- Paralinguistic and Dialogue Behavior: Analysis of pauses, gaps, overlaps, fillers, laughter, and turn-taking, using VAD segmentation and dialog act labeling (Nguyen et al., 2022, Mai et al., 8 Jan 2025).
- Zero-Shot and Few-Shot Generalization: In-context learning capability for novel classification tasks, with significant gains upon warmup and prompt training (Hsu et al., 2023).
Empirically, SpidR units lead current benchmarks, e.g., sWUGGY = 71.89%, sBLIMP = 59.48% (zero-shot) (Poli et al., 23 Dec 2025). Preference-optimized Align-SLM achieves sWUGGY = 77.9%, sBLIMP = 62.3%, and high semantic StoryCloze accuracy (T-StoryC 86.8%) (Lin et al., 4 Nov 2024). Human ratings of prosody and content quality in pGSLM often approach those of “oracle” resynthesis (Kharitonov et al., 2021).
4. Specialized Subfields: Prosody, Dialogue, and Translation
Dedicated subdomains have emerged:
- Prosody Modeling: Multi-stream LMs parameterize pitch and duration jointly with units, enabling expressive synthesis and prosodic transfer (Kharitonov et al., 2021, Chou et al., 12 Aug 2025). Flow-SLM models continuous acoustic features directly and outperforms previous LMs on acoustic-and-prosody benchmarks, albeit with a small tradeoff in linguistic perplexity (Chou et al., 12 Aug 2025).
- Textless Dialogue Systems: Models such as dGSLM and SLIDE learn to represent and generate spoken dialogue, backchannels, and laughter, with turn-taking regulated through hybrid architectures (dual-tower transformers, two-tower duration predictors) (Nguyen et al., 2022, Lu et al., 1 Jan 2025, Mai et al., 8 Jan 2025).
- Speech-to-Speech Translation (S2ST): Both cascaded (encoder-decoder) and strict end-to-end LMs have achieved S2ST by modeling semantic and acoustic units, with competitive BLEU and high speaker style preservation. Explicit task prompts and auxiliary “unit language” representations enhance cross-lingual generalization and mitigate destructive multitask interference (Zhang et al., 21 May 2025, Peng et al., 19 Mar 2024).
Recent work introduces unsupervised “unit language” as a hierarchical, text-like unit-gram abstraction for cross-modal and cross-lingual supervision, achieving BLEU parity between unit-language and true-text models on VoxPopuli S2ST (Zhang et al., 21 May 2025).
5. Limitations, Tradeoffs, and Open Research Problems
Current limitations and outstanding challenges include:
- Semantic Coverage Gaps: Even large SLMs underperform text-based LLMs in long-range semantic coherence and content fidelity. DPO and LLM-guided preference rewards substantially close this gap, but do not yet address paralinguistics and prosody (Lin et al., 4 Nov 2024).
- Noise Sensitivity: SLMs are vulnerable to additive and especially babble noise, with significant degradations in UER, PER, and WER in such conditions (Park et al., 2023). End-to-end VQ training and robust feature learning are active research areas.
- Granularity and Temporal Scale: Frame-based discretization can be suboptimal for semantics; segmentation into unit “words” (via -gram or BPE) partially alleviates this, but full alignment with lexical or morpho-syntactic units remains unresolved (Zhang et al., 21 May 2025).
- Style, Paralinguistics, and Multilinguality: Explicit modeling of speaker style, prosody, laughter, and dialogue dynamics is nascent. Some models reach strong naturalness (N-MOS) but lag in meaningfulness (M-MOS), pointing to a disconnect between surface fluidity and content encoding (Nguyen et al., 2022, Mai et al., 8 Jan 2025).
- Resource Inefficiency and Generalization: Many SLM training pipelines require thousands of GPU-hours and massive datasets. SpidR reduces pretraining from days to hours and makes rapid benchmarking feasible (Poli et al., 23 Dec 2025).
- Evaluation Standardization: The field coalesces around the ZeroSpeech 2021 metrics and derived benchmarks, but semantics and higher-level reasoning are still challenging to evaluate automatically.
6. Practical Implementation and Ecosystem
The textless-lib open-source library provides standard, modular pipelines for extraction (e.g. HuBERT, VQ), unit modeling, resynthesis, and evaluation; models can be constructed and trained with <50 lines of code (Kharitonov et al., 2022). For rapid benchmarking, SpidR offers PyTorch-based, efficient, and robust code with validated unit-selection heuristics and streamlined distributed pretraining (Poli et al., 23 Dec 2025). HiFi-GAN and EnCodec are commonly used for high-fidelity unit-to-waveform decoding (Peng et al., 19 Mar 2024, Kharitonov et al., 2021).
Convenient recipes and code enable speaker probing, compression, and textless continuation. Fine-tuning, adaptive codebook design, and feature-rooted multilingual transfer are recommended for new domains and languages.
7. Outlook and Future Directions
Research frontiers in textless spoken language modeling include:
- Scaling and Hybridization: Scaling model and data size, and integrating auxiliary text/textless streams, to further improve lexical/syntactic/semantic coverage (Chou et al., 12 Aug 2025).
- Zero-Text and Multilingual Transfer: Extending unit discovery and LM adaptation to unwritten and highly under-resourced languages, potentially via meta-learning and transfer from high-resource conditions (Zhang et al., 21 May 2025, Poli et al., 23 Dec 2025).
- Paralinguistics and Style: Deeper integration of prosody, emotion, and style via explicit losses, multi-stream modeling, and feedback mechanisms (Kharitonov et al., 2021, Peng et al., 19 Mar 2024).
- Preference and Feedback Optimization: Systematic application of LLM-based, human-in-the-loop, or curriculum preference learning to align SLM generations with human- or task-centric objectives (Lin et al., 4 Nov 2024).
- Robustness and Denoising: Robust unit learning under noise, adversarial, and multilingual conditions via improved architectures and training procedures (Park et al., 2023).
- Rich Conversational Capabilities: Next-generation dialogue models combining the low-latency, paralinguistic fluency of RTTL-DG with the enhanced coherence and knowledge grounding of LLMs (Mai et al., 8 Jan 2025, Lu et al., 1 Jan 2025).
Textless SLM represents a rapidly maturing paradigm, with strong empirical advances across lexical, syntactic, prosodic, and translation tasks, but open challenges in semantic depth, robustness, style control, and generalization outside conventional languages and datasets. The ongoing convergence of large-scale SLMs, LLM integration, and unsupervised unit discovery is expected to accelerate progress toward robust, universal spoken-language modeling systems.