GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot (2412.02612v1)

Published 3 Dec 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text LLM GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech LLMing and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.

PDF HTML Abstract

Overview of GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

The paper presents GLM-4-Voice, an innovative end-to-end spoken chatbot model that seeks to enhance the naturalness and expressiveness of voice-based interactions. This work addresses significant challenges in developing spoken chatbots capable of understanding and exhibiting nuanced human-like conversational behaviors using advanced speech-LLMing techniques. Below is an expert analysis of the methodologies, experimental results, and potential implications outlined within the paper.

Methodology

GLM-4-Voice is designed to capture the subtleties of human speech through several novel methodologies and architectural choices:

Single-Codebook Supervised Speech Tokenization: The model employs a supervised single-codebook speech tokenizer operating at 12.5Hz, yielding an ultra-low bitrate of 175bps. This efficient tokenization helps preserve semantic content while facilitating high-quality speech reconstruction. Unlike prior approaches requiring multi-layer token generation, GLM-4-Voice's tokenizer preserves the semantic richness with minimal bitrate, crucial for maintaining conversational coherence.
Flow-Matching-Based Speech Decoder: The decoder relies on a flow-matching mechanism, integrated with a HiFi-GAN vocoder, to convert discrete speech tokens into fluent and natural-sounding audio that is crucial for delivering expressive spoken responses with minimal latency.
Large-Scale Pre-Training: Extending from the GLM-4-9B text LLM, the paper applies a comprehensive pre-training procedure spanning 1 trillion tokens using a combination of synthetic interleaved speech-text datasets, unsupervised speech data, and supervised ASR/TTS data. This extensive training regime is pivotal in enhancing the model's cross-modal capabilities.
Fine-Tuning with Conversational and Style-Controlled Datasets: Building on the pre-trained base model, GLM-4-Voice undergoes fine-tuning on datasets designed to bolster conversational quality and control speech style via direct synthesis, thereby tailoring the chatbot's responses to align with user preferences dynamically.
Optimization for Streaming Inference: The architecture supports low-latency interaction, essential for real-time applications. By leveraging a "streaming thoughts" template, the model optimizes response time, interleaving text and speech token generations to produce timely and contextually relevant utterances.

Experimental Results

GLM-4-Voice exhibits strong performance across multiple tasks, surpassing baseline models in areas such as speech LLMing, ASR, TTS, and spoken question answering. The experimental evaluations reveal key insights:

On speech LLMing, GLM-4-Voice demonstrates superior performance in both S→S and S→T settings, indicating an effective integration of speech and text modalities.
In spoken question answering, it shows competitive accuracy in comparison to other state-of-the-art models, particularly when leveraging textual guidance.
Evaluation of the chat model using metrics like ChatGPT Score indicates an improvement in both general QA and knowledge tasks, further emphasizing the model's conversational adeptness and comprehensive knowledge encoding.
Furthermore, the speech quality, assessed via UTMOS, indicates an improvement in naturalness compared to other contemporary spoken chatbot implementations.

Implications and Future Directions

The research outlined in this paper has both practical and theoretical implications:

Practical Implications: The efficient and nuanced speech generation capabilities of GLM-4-Voice make it a viable candidate for deployment in various applications requiring sophisticated vocal human-computer interaction, such as customer service, virtual assistants, and interactive learning platforms.
Theoretical Advancements: This research enhances our understanding of integrating speech and text modalities within unified AI frameworks, showcasing the benefits of large-scale multi-modal pre-training for voice AI systems.

Future developments could explore extending the architecture to support additional languages and dialects, refining the nuances of voice emotion and style reproduction, and integrating real-world conversational datasets that are diverse and not limited to synthetic or controlled environments. Additionally, investigating the model's robustness and fairness across different demographic and linguistic settings would be pivotal in advancing its real-world application.

In conclusion, GLM-4-Voice offers a substantial step forward in the domain of spoken chatbots, merging linguistic processing and speech technology to create interfaces that are both intelligent and remarkably human-like. The open-source nature of the model encourages further research and innovation, potentially setting new standards in voice AI systems.