LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis (2505.02625v1)

Published 5 May 2025 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on LLMs. In this paper, we introduce LLaMA-Omni 2, a series of speech LLMs (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

Summary

The paper introduces LLaMA-Omni 2, a unified SpeechLM architecture integrating speech encoding, language processing, and autoregressive streaming synthesis to reduce errors and latency in spoken chatbots.
LLaMA-Omni 2 utilizes a "Read-R-Write-W" strategy for its autoregressive streaming speech synthesis, achieving significantly lower response latency (around 600ms) compared to previous state-of-the-art models.
Evaluations show LLaMA-Omni 2 surpasses existing models like GLM-4-Voice in accuracy on spoken QA and instruction following tasks, demonstrating efficient performance with limited training data.

An Overview of LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot

The paper "LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis" provides a detailed exploration into a novel approach for developing intelligent spoken chatbots using LLMs. It introduces LLaMA-Omni 2, a series of modular speech LLMs (SpeechLMs) possessing the capability for high-quality real-time speech interaction. By integrating a speech encoder and an autoregressive streaming speech decoder, LLaMA-Omni 2 aims to address the limitations inherent in traditional cascaded pipelines used for speech processing, such as accumulated errors and high response latency.

Key Contributions

The paper presents LLaMA-Omni 2, modeled upon the Qwen2.5 series with parameters ranging from 0.5B to 14B, highlighting several key contributions and claims that underscore its proposed methodology:

Unified Model Architecture: LLaMA-Omni 2 leverages a unified model architecture that combines speech encoding, LLM processing, and autoregressive speech decoding in one framework. It offers a streamlined process from receiving speech input to generating speech output, thereby reducing errors typically seen across multiple model stages in traditional approaches.
Streaming Speech Synthesis: The autoregressive streaming method of generating speech ensures synchronized production of text and speech with significantly reduced latency. The introduction of a "Read- $\mathcal{R}$ -Write- $\mathcal{W}$ " strategy is aimed at facilitating this reduction in latency, providing a competitive edge in real-time speech generation compared to previous state-of-the-art models.
Performance Metrics: The paper compares LLaMA-Omni 2 against existing models on benchmarks for spoken question answering and speech instruction following tasks. Data indicates that LLaMA-Omni 2 surpasses earlier models like GLM-4-Voice, achieving higher accuracy in understanding and generating speech responses with lower response latency, around 600ms.

Implications and Future Directions

The development of LLaMA-Omni 2 is a considerable advancement in speech-enabled AI models, offering implications for broader applications in human-computer interaction (HCI). By demonstrating strong performance with only 200K multi-turn dialogue samples, it suggests a more data-efficient training process without sacrificing performance, potentially reducing costs related to data collection and processing.

Moreover, the successful integration of LLMs with enhanced speech synthesis capabilities opens avenues for further research into more human-like speech generation, enriched with emotional intelligibility and dialectal variations. Future work may involve exploring these directions, as well as optimizing the model to enhance real-time interaction and paralinguistic comprehension.

Conclusion

LLaMA-Omni 2 represents an essential step forward in real-time spoken chatbot technology by integrating advanced SpeechLM methodologies and autoregressive LLM capabilities. Its robust performance across tested benchmarks, coupled with reduced latency, signals potential for impactful improvements in interactive AI systems. Consequently, this paper provides a solid foundation for advancing speech-language interfaces, making them more efficient and intelligent in handling real-world applications.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1919677877605355642

https://twitter.com/TheTuringPost/status/1922260755686293666

https://twitter.com/fly51fly/status/1921676674841690461

https://twitter.com/HuggingPapers/status/1921537628031099327

https://twitter.com/dhirajpatra/status/1920639027591397726

https://twitter.com/AudioAndSpeech/status/1919753060600905961