Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (2408.16725v3)

Published 29 Aug 2024 in cs.AI, cs.CL, cs.HC, cs.LG, cs.SD, and eess.AS

Abstract: Recent advances in LLMs have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

PDF HTML Abstract

Overview of "Mini-Omni: LLMs Can Hear, Talk While Thinking in Streaming"

"Mini-Omni: LLMs Can Hear, Talk While Thinking in Streaming" primarily addresses a critical gap in the capabilities of LLMs: real-time voice interaction. While models such as GPT-4o have pioneered multimodal understanding and capabilities, this paper introduces Mini-Omni, the first open-source, end-to-end conversational model capable of real-time speech interaction without significant latency.

The central innovation presented in the paper is the proposal of a text-instructed speech generation method that allows for parallel generation of text and audio tokens, enhancing the model's efficiency and preserving its language capabilities. This is augmented by batch-parallel strategies that further bolster the model's reasoning capabilities during streaming inference.

Key Contributions

Mini-Omni Model:
- The model is designed as an end-to-end system integrating speech recognition (ASR), text generation, and text-to-speech (TTS) functionalities.
- Utilizes off-the-shelf methods for discretizing speech tokens and implements the simplest model architecture to allow for broader adaptability.
Text-Instructed Parallel Generation:
- The proposed text-instructed speech generation enables the simultaneous production of text and audio tokens with minimal data, preserving the text reasoning capabilities of LLMs.
Any Model Can Talk Method:
- This approach is aimed at training other text models to develop speech capabilities with minimal data and modification. It involves a three-phase training process including modality alignment, adaptation training, and multi-modal fine-tuning.
VoiceAssistant-400K Dataset:
- Introduction of a new dataset specifically generated by GPT-4o for training speech models, aiding in fine-tuning models to improve spoken dialogue capabilities.

Strong Numerical Results

The experiments conducted indicate robust performance in foundational tasks. Key results include:

Speech Recognition (ASR):
- Mini-Omni demonstrated competitive performance in speech recognition tasks, yielding a Word Error Rate (WER) of 4.5% on the LibriSpeech test-clean set, comparable to established models like Whisper-small.

Methodologies and Implementation

The model leverages a novel framework for simultaneous text and audio token generation, with specific strategies such as:

Parallel Decoding Strategy:
- Diverging from traditional sequential processing, the model employs parallel generation of audio tokens, which are conditioned on corresponding text tokens. This effectively mitigates the latency issues observed in previous models requiring sequential text-to-audio transformations.
Text-Delay Parallel Decoding:
- Further enhances efficiency by delaying the generation of audio tokens relative to text tokens, ensuring real-time performance in dialogue scenarios.
Batch Parallel Decoding:
- Introduces a batch approach where text-only responses are fused into the audio generation process to retain high reasoning capabilities in audio responses.

Theoretical and Practical Implications

The implications of Mini-Omni's design are significant for both theory and practice. Theoretically, the text-instructed parallel generation paradigm challenges conventional sequential processing models, suggesting that parallel decoding might be a viable alternative for multimodal generative models.

Practically, the introduction of the Mini-Omni model and the "Any Model Can Talk" framework opens up new possibilities for real-time conversational AI systems. These methods drastically reduce the complexity and resource demand traditionally associated with training models for multimodal outputs.

Speculations on Future Developments

Looking forward, several future directions are apparent:

Enhanced Modality Alignment:
- Future research could explore more sophisticated alignment techniques that further reduce the computational overhead and improve the fidelity of cross-modal interactions.
Advanced Datasets for Multimodal Training:
- Creating more comprehensive datasets that blend various modalities seamlessly could amplify the capabilities of models like Mini-Omni, enabling richer and more nuanced interactions.
Integration with More Robust LLMs:
- Leveraging larger and more advanced pre-trained LLMs could push the boundaries of what is achievable with the proposed methods, translating to even better performance in real-world applications.

In conclusion, the "Mini-Omni" paper marks a significant step towards achieving seamless, real-time multimodal interactions. By addressing latency issues and proposing a flexible, efficient training framework, this paper sets the stage for future innovations in conversational AI and multimodal LLMing.