Papers
Topics
Authors
Recent
2000 character limit reached

Textless Dialogue Generation

Updated 4 December 2025
  • Textless dialogue generation is a paradigm that bypasses conventional text by processing raw audio waveforms using self-supervised learned discrete acoustic units.
  • Models like dGSLM and RTTL-DG leverage dual-tower and streaming architectures to reduce latency and capture paralinguistic cues such as laughter, back-channels, and overlapping speech.
  • Despite improved turn-taking and natural rhythm, these systems currently lag in semantic coherence compared to traditional ASR-NLG-TTS pipelines, highlighting areas for further research.

Textless dialogue generation refers to a paradigm in spoken dialogue systems where all stages—including input encoding, interaction modeling, and output synthesis—operate directly on raw audio waveforms or learned acoustic units, entirely bypassing canonical text representations. Unlike conventional automated dialogue systems that rely on automatic speech recognition (ASR), natural language generation (NLG), and text-to-speech (TTS) as sequential, cascaded modules, textless systems discover and process their own discrete, speech-derived unit vocabularies to achieve naturalistic, human-like conversational flow, encompassing both verbal and paralinguistic phenomena such as laughter, filled pauses, and overlapping speech (Nguyen et al., 2022, Mai et al., 8 Jan 2025).

1. Motivation and Problem Framing

Traditional spoken dialogue architectures are dominated by cascaded pipelines in which input speech is transcribed to text (ASR), language is modeled in text (LM/NLG), and output is rendered to speech via TTS. This cascade introduces several limitations:

  • Latency and Rigidity: Sequential processing engenders high response delays and rigid turn boundaries.
  • Information Loss: Rich paralinguistic cues (e.g., laughter, breath, spontaneous hesitations) are discarded during ASR and not recoverable by downstream modules.
  • Poor Overlap Modeling: Text representations inadequately capture true conversational overlap, back-channels, and seamless floor-switches.

Textless dialogue generation is designed to rectify these deficits by dispensing with explicit text as an intermediate representation. It achieves this by operating directly on sequences of speech-derived discrete units, ensuring that non-verbal and paralinguistic features are directly modeled and synthesized (Nguyen et al., 2022, Mai et al., 8 Jan 2025).

2. Discrete Speech Unit Discovery and Representation

Core to textless dialogue systems is the unsupervised discovery of a discrete speech unit vocabulary:

  • Acoustic Feature Extraction: Self-supervised models such as HuBERT (7-layer strided CNN frontend, multiple Transformer layers) are trained to produce frame-level embeddings from raw waveform, learned via prediction of clustered acoustic features.
  • Vector Quantization: These embeddings are quantized into discrete clusters (e.g., K=500 (Nguyen et al., 2022), K≈10,000 (Mai et al., 8 Jan 2025)) via k-means or similar algorithms:

min{μk},c()  t=1Thtμc(t)2,  c(t)=argminkhtμk2\min_{\{\mu_k\},\,c(\cdot)}\;\sum_{t=1}^T\left\|h_t - \mu_{c(t)}\right\|^2 ,\; c(t)=\arg\min_{k}\|h_t-\mu_k\|^2

  • Paralinguistic Coverage: The resultant codebooks encode not only phone-like segments but also communicative and expressive events—laughter, filled pauses, back-channels—without explicit supervision (Nguyen et al., 2022, Mai et al., 8 Jan 2025).

This yields unit streams per speaker, which serve as the substrate for all downstream modeling and generation.

3. Model Architectures and Dialogue Control

dGSLM ("Dual-Tower Generative Spoken Dialogue LLM")

  • Dual-Tower Transformer: Two mirrored towers, one per speaker, each with 6 layers, 8-head self-attention, top-4-layer cross-attention, and standard Pre-LN normalization.
  • Cross-Attention: Each tower’s internal representation is updated via:

CA(Hself(c),,Hself(cˉ),)=softmax(QK/d)V\mathsf{CA}\bigl(H^{(c),\ell}_{\text{self}},\,H^{(\bar c),\ell}_{\text{self}}\bigr) =\mathrm{softmax}(QK^\top/\sqrt{d})\,V

  • Speaker-Agnosticity: Shared parameters across both towers and codebooks.

RTTL-DG (Real-Time Textless Dialogue Generation)

  • Streaming Encoder: 8-layer causal Transformer with state merges post layers 2/4/6, operating on 20 ms waveform slices, reducing to 160 ms contextual embeddings with D=768.
  • Context Tracker: Slide-window based, predicts at 160 ms intervals one of four action tokens: Remain Silent, Initiate Speaking, Continue Speaking, Stop Speaking.
  • Autoregressive Decoder: 8-layer Transformer (D=1536, 16 heads) outputs discrete units when speaking.
  • Paralinguistic Module: Speech units and auxiliary heads optionally extract features (pitch, energy, back-channel probability), which are summed into encoder state.
Model Key Features Codebook Size
dGSLM Dual-tower, cross-attend 500
RTTL-DG Streaming, unified decoder ≈10,000

4. Training Objectives and Inference

Both paradigms employ multi-objective cross-entropy losses:

  • Action Prediction: (RTTL-DG)

Laction=k=1ni=14yk,ialogp(ak=iH)\mathcal{L}_{\text{action}} = -\sum_{k=1}^n\sum_{i=1}^4 y_{k,i}^a \log p(a_k=i|H)

  • Speech Unit Modeling: Conditional cross-entropy loss over generated unit sequences.

Lresponse=k:ak=SPKt=1LvVyk,t,vulogp(uk,t=vuk,<t,H)\mathcal{L}_{\text{response}} = -\sum_{k: a_k=\text{SPK}} \sum_{t=1}^L \sum_{v\in V} y_{k,t,v}^u \log p(u_{k,t}=v | u_{k,<t},H)

  • Duration Modeling: (dGSLM) Separate L1 loss for predicting frame durations per unit.
  • Combined Loss: Aggregation (e.g., λa=1,λu=1\lambda_a=1, \lambda_u=1) with optional auxiliary losses (e.g., back-channel classification).

Inference proceeds autoregressively, generating action and unit tokens at each step, with unit streams subsequently rendered to audio via neural vocoders such as HiFi-GAN (Nguyen et al., 2022, Mai et al., 8 Jan 2025).

5. Evaluation Metrics and Performance

Evaluation encompasses conventional naturalness and semantic coherence, alongside metrics capturing dialogic and paralinguistic fidelity:

  • Turn-taking Statistics: Rates of overlaps, pauses, floor-transfer offset histograms; dGSLM and RTTL-DG closely match empirical human statistics (mean gap ≈153 ms and 393 ms for dGSLM and RTTL-DG, respectively).
  • Natural-Dialogue Rates: Speaking rate (words/minute), laughter per minute, filler-word rate (dGSLM: 212 wpm, 3.6 lpm, 5.5% vs. ground-truth 181 wpm, 3.6 lpm, 7.3%).
  • Semantic Coherence: State-of-the-art textless models lag behind text-based cascades in transcript perplexity and coherence (e.g., dGSLM perplexity >150 vs. cascaded ~32; RTTL-DG coherence 4.8/10 vs. 6.4/10 for text-cascaded).
  • Human Ratings: Mean Opinion Scores for naturalness and meaningfulness; dGSLM achieves high naturalness on turn-taking (N-MOS ≈3.7/5), outperforming cascaded in this respect.
  • Full Duplex Interaction: RTTL-DG realizes authentic overlaps (5.7/min vs. 0 for cascaded, 4.3 for human) and back-channels, reducing average turn-switch latency below 400 ms (Nguyen et al., 2022, Mai et al., 8 Jan 2025).
Metric Cascaded dGSLM RTTL-DG Human
Overlaps/min 0.0 n/a 5.7 4.3
Avg gap (ms) 800 153 393 518
Laughter/min 0.0 3.6 0.22 2.02
N-MOS (naturalness, /5) 2.4 3.7 n/a 4.2

6. Analysis of Generated Dialogues

Generation output is characterized by:

  • Emergent Paralinguistics: Laughter, hesitations, and back-channels appear as discrete symbols in unit sequences, producing realistic overlaps, chuckles, “mhm”, etc.
  • Spectrotemporal Fidelity: HiFi-GAN reconstructions generate accurate pitch, intensity, and timing patterns, with observable phenomena such as counterintuitive longer within-speaker pauses vs. gaps.
  • Self-Consistency: Models reproduce human dialogue statistics in the absence of explicit text, a property not matched by ASR-LM-TTS pipelines (Nguyen et al., 2022, Mai et al., 8 Jan 2025).

Qualitative comparisons indicate that in similar conversational prompts, RTTL-DG initiates responsive back-channels and overlapping interjections with much lower latency than cascaded systems, leading to perceptibly more human-like conversational rhythm.

7. Current Limitations and Prospects

While textless models excel at turn-taking, paralinguistic control, and conversational timing, their semantic coherence lags behind powerful text-based LLMs—primarily due to the limited scale and abstraction of speech-unit representations and the smaller volume of accessible training data relative to text. However, large-scale synthetic pretraining and richer acoustic unit discovery can partially mitigate this gap; for example, pretraining RTTL-DG on 5,798 hours of synthetic data improves next-action accuracy and coherence (action accuracy 83%→85%; coherence 4.8→5.2) (Mai et al., 8 Jan 2025).

A plausible implication is that, with continued increases in unlabeled speech-data scale and improved discrete acoustic modeling, textless dialogue generation may approach or exceed text-supervised models in both surface naturalness and communicative richness. Further research is needed to close the semantic gap and optimize the learnability and controllability of discovered speech units.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Textless Dialogue Generation.