LLM Can Listen While Speaking
The paper "LLM Can Listen While Speaking" by Ma et al. primarily addresses the limitations of turn-based conversational AI models by proposing a novel approach to real-time human-computer interaction (HCI) in speech LLMs (SLMs). The authors introduce the concept of Full Duplex Modeling (FDM) within interactive speech LLMs (iSLMs), which enables simultaneous listening and speaking during conversations. This paper presents the design and evaluation of an innovative model called the Listening-while-Speaking LLM (LSLM), which demonstrates this capability.
Introduction
The research begins by highlighting the naturalness of dialogue as a mode of HCI and reviews recent advancements in speech LLMs based on LLMs. The existing models lack the ability to handle real-time spoken interactions due to their turn-based nature. To overcome this limitation, the authors propose FDM for iSLMs, aiming to improve real-time interaction by equipping the models with the ability to handle interruptions effectively.
Model Design
The LSLM is introduced as an end-to-end system incorporating both listening and speaking channels. For speech generation, the model uses a token-based decoder-only Text-to-Speech (TTS) system. For real-time audio input, it employs a streaming self-supervised learning (SSL) encoder. The fusion of these channels enables autoregressive generation and real-time turn-taking detection.
Three fusion strategies are explored:
- Early Fusion: Integrates the listening and speaking channels at the input embeddings before autoregressive prediction.
- Middle Fusion: Merges the channels at each Transformer block by adding the listening channel to the input of each Transformer block.
- Late Fusion: Combines the channels at the output logits before the softmax operation.
Among these strategies, the middle fusion approach is found to be the most effective in balancing the requirements of speech generation and real-time interaction.
Experimental Evaluation
The LSLM's performance is evaluated under two experimental settings: command-based FDM and voice-based FDM. The experiments demonstrate that the LSLM is robust to noise and can respond to a variety of instructions from unseen speakers. The following key results were reported:
- The command-based FDM setting showed that the middle fusion approach achieved a Word Error Rate (WER) of 4.05% in clean conditions and 4.51% in noisy conditions, indicating minimal impact on the speech generation capability.
- For interactive capability, the LSLM with middle fusion achieved high precision (97.80%), recall (98.19%), and F1 score (98.00%) under clean conditions, and performed competitively under noisy conditions.
- The voice-based FDM setting, which involved diverse interruption commands and unseen speakers, further challenged the model. Despite this, the LSLM maintained reasonable performance with a WER of 5.33% in clean conditions and 8.50% in noisy conditions.
Implications and Future Directions
The introduction of FDM in iSLMs marks a significant step toward enhancing real-time spoken interaction in conversational AI. The ability to handle interruptions in real-time not only improves user experience but also broadens the applicability of speech dialogue systems in dynamic and noisy environments.
Future research directions could include integrating audiovisual co-guidance mechanisms to improve turn-taking, enhancing speaker-following capabilities to identify interrupting speakers more accurately, and expanding the duplex modeling to more complex dialogue scenarios involving multiple speakers and more intricate interaction patterns.
Conclusion
The paper by Ma et al. presents a comprehensive approach to overcoming the limitations of turn-based SLMs by introducing full duplex modeling. The proposed LSLM demonstrates the feasibility and effectiveness of simultaneous listening and speaking in speech LLMs. By achieving significant results in real-time interaction metrics, this paper lays the groundwork for the development of more advanced and user-friendly interactive speech dialogue systems.