Generative Spoken Dialogue LLMing
The paper "Generative Spoken Dialogue LLMing" presents an innovative approach to modeling spoken dialogues using a textless model, dGSLM. Aiming to generate naturalistic spoken dialogues in audio form without relying on textual data or labels, the research leverages developments in self-supervised learning and speech processing. The methodology involves three main components: unsupervised spoken unit discovery, dual-tower transformer architecture, and training on extensive two-channel conversational audio data. This paper examines various aspects including turn-taking, paralinguistic signals, and both content and duration modeling.
The dGSLM model is trained using 2000 hours of raw conversational audio from the Fisher dataset. It employs both self-supervised discrete speech representation models and a novel dual-tower transformer architecture with cross-attention. Through the discrete encoder (HuBERT + k-means), a Dialogue LLM (DLM) is developed to transform dialogues into a discrete unit stream before producing audio waveforms with HifiGAN, effectively generating speech that includes laughter and backchannels. The model exhibits improved turn-taking and overlap synchronization compared to text-based models. Notably, it achieves more naturalistic dialogue characteristics, although it scores lower on semantic content compared to conventional ASR+LM+TTS systems.
Methodology
The paper’s methodology centers on bypassing traditional text-based interfaces, using audio channels for encoding and generation tasks. This is particularly significant as it addresses the inherent limitations of artificial dialogue systems that rely on fixed mechanisms for speaker turn-taking, such as wake words or silent pauses. The dual-tower architecture, trained with cross-attention between speaker channels, synchronizes dialogue with more fluid turn-taking. Various model configurations are explored, including a simpler single-tower model called MS-TLM.
The discrete speech units derived from HuBERT are encoded and decoded using an augmented HifiGAN vocoder, resulting in vocal outputs inclusive of prosodic and non-verbal elements typical of emotional speech. The model particularly emphasizes the importance of edge unit prediction—which improves unit modeling—and a delayed duration prediction framework. Results indicate that a cross-attention mechanism and delayed prediction aid in more accurate modeling of dialogue content and synchronization.
Evaluation and Results
The paper presents extensive evaluations on both training and generation metrics. In training, negative log-likelihood (NLL) and duration prediction accuracy metrics indicate the impact of cross-attention layers and prediction objectives on model performance. In generation, statistical analyses compare model outputs to ground truth continuations, examining turn-taking metrics including Inter-Pausal Units (IPU), gaps, pauses, and overlaps.
Compared to the cascaded baseline model, dGSLM demonstrates superior performance in reproducing naturalistic dialogue synchronization even though semantic coherence and meaningfulness are areas for future improvement. Human evaluations endorse the model's ability to mimic natural turn-taking dynamics, despite the semantic gap with ground truth data.
Implications and Future Work
The implications of this research are substantial, both practically and theoretically. Practically, the model paves the way for enhanced human-machine dialogue systems with more realistic interactions that incorporate spontaneous feedback signals like laughter and hesitations—elements missing in traditional systems. Theoretically, it highlights the potential for models trained solely on acoustic signals to learn linguistic and paralinguistic features crucial for dialogue coherency.
Future developments could include integrating semantic models and exploring larger datasets for improved contextual understanding. Furthermore, the integration of pitch and prosody-related data along with dual-stream input configurations might enhance syntactic and semantic processing capabilities. Ultimately, advancing the minimal supervision approach in linguistic modeling may yield more sophisticated dialogue systems, aligning closer to human conversational norms.