LLaMA-Omni: Seamless Speech Interaction with Large Language Models (2409.06666v1)

Published 10 Sep 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Models like GPT-4o enable real-time interaction with LLMs through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-LLMs, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-LLMs in the future.

PDF Abstract

LLaMA-Omni: Seamless Speech Interaction with LLMs

The paper "LLaMA-Omni: Seamless Speech Interaction with LLMs" addresses a critical gap in the domain of LLMs by proposing an innovative model architecture named LLaMA-Omni. This model is designed to facilitate low-latency, high-quality speech interactions by seamlessly integrating various speech processing components with an LLM.

Model Architecture

LLaMA-Omni comprises four essential components:

Speech Encoder: The model employs the Whisper-large-v3 encoder for translating speech input into meaningful representations. This encoder remains frozen during the training process.
Speech Adaptor: A trainable speech adaptor maps the output of the speech encoder into the LLM's embedding space. It utilizes a two-layer perceptron following a downsampling operation on the speech representation sequences.
LLM: The architecture is built upon the state-of-the-art Llama-3.1-8B-Instruct model, known for its robust conversational and reasoning capabilities. The LLM processes the speech representation and generates responses.
Streaming Speech Decoder: A non-autoregressive streaming Transformer architecture is employed to generate speech responses by mapping the output hidden states of the LLM to sequences of discrete units.

The design eliminates the necessity for intermediate text transcription, thereby reducing latency and enhancing response quality.

Dataset Construction

To align the model with actual speech interaction scenarios, the authors have constructed a dataset named InstructS2S-200K, which includes 200K pairs of speech instructions and corresponding speech responses. The dataset is created using a systematic process involving the rewriting of text instructions into speech-suitable formats, followed by response generation and speech synthesis.

Key Results and Metrics

The paper presents several numerical results showcasing LLaMA-Omni's superior performance:

ChatGPT Score: For speech-to-text instruction-following (S2TIF) and speech-to-speech instruction-following (S2SIF) tasks, LLaMA-Omni achieved higher content and style scores compared to previous models like SpeechGPT, SALMONN, and Qwen2-Audio.
Response Latency: LLaMA-Omni demonstrated the lowest response latency, achieving a minimum of 226ms, which is significantly lower than that of GPT-4o's audio latency of 320ms.
Speech-Text Alignment: The model exhibited the lowest ASR-WER (11.61) and ASR-CER (7.59) scores, indicating a high degree of alignment between generated text and speech responses.

Implications and Future Directions

Practical Implications: The low response latency and high-quality responses make LLaMA-Omni particularly suitable for applications demanding real-time interaction, such as virtual assistants and conversational agents. The efficient training process, requiring less than three days on just four GPUs, reduces the barrier to developing powerful speech interaction models based on advanced LLMs.

Theoretical Implications: The model validates the feasibility of integrating various speech processing components with LLMs to achieve seamless, high-quality speech interactions. This approach could set a precedent for future research focusing on multimodal integration within LLM frameworks.

Future Directions

Several future research directions could further enhance LLaMA-Omni or explore adjacent areas:

Expressiveness of Speech Responses: Future work could aim at increasing the expressiveness and naturalness of the generated speech, potentially integrating more advanced prosodic features.
Real-Time Interaction Capabilities: Optimization strategies for reducing latency further could be explored, ensuring even smoother real-time interactions.
Generalization Across Languages: Expanding the model's capabilities to support multiple languages could improve its applicability in global contexts.

Conclusion

LLaMA-Omni introduces a novel model architecture facilitating seamless and efficient speech interaction with LLMs. Its design, incorporating a speech encoder, adaptor, LLM, and streaming speech decoder, evidences substantial improvements in response latency, quality, and alignment between speech and text. The implications of this research are significant for both practical applications and theoretical advancements in the domain of multimodal LLM integration.