MinMo: A Multimodal Large Language Model for Seamless Voice Interaction (2501.06282v1)

Published 10 Jan 2025 in cs.CL, cs.AI, cs.HC, cs.SD, and eess.AS

Abstract: Recent advancements in LLMs and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal LLM with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudioLLM.github.io/minmo, and the code and models will be released soon.

PDF Abstract

MinMo: A Multimodal LLM for Seamless Voice Interaction

The paper "MinMo: A Multimodal LLM for Seamless Voice Interaction" by the FunAudioLLM Team from Alibaba Group introduces MinMo, a multimodal LLM designed to optimize voice interactions through seamless integration of speech and text modalities. This essay provides an expert overview of the key methodologies, results, and implications of this research.

Overview of MinMo

MinMo is a sophisticated multimodal LLM with approximately 8 billion parameters, developed to address the limitations observed in earlier speech-text models for seamless voice interaction. Existing models are generally bifurcated into native multimodal models, which struggle with sequence length discrepancies between speech and text, and aligned multimodal models, which face challenges with large-scale speech data and complex task execution.

Methodology

MinMo adopts a strategic multi-stage training approach to align both speech and text modalities. This involves:

Speech-to-Text Alignment: Utilizing large-scale speech data to align the audio input latent space with a pre-trained text LLM.
Text-to-Speech Alignment: Developing an Output Projector and Voice Token LM to bridge the semantic representations of text into the audio output latent space.
Speech-to-Speech Alignment: Enhancing audio-to-audio interactions using substantial paired audio data, enabling nuanced control over speech style and delivery based on user instructions.
Duplex Interaction Alignment: Implementing a full-duplex prediction module to facilitate real-time two-way communication, allowing the system to manage simultaneous speaking and listening tasks effectively.

Results

MinMo achieves state-of-the-art performance across multiple benchmarks, demonstrating superior capabilities in speech comprehension, generation, and full-duplex interaction. Notable results include:

Speech Recognition and Translation: MinMo surpasses existing models like Whisper and Qwen2-Audio in both ASR and multilingual speech translation tasks. It effectively maintains its performance without requiring language identification as a prompt.
Speech Emotion and Audio Event Recognition: MinMo improves upon previous models in understanding complex speech attributes such as emotion and audio events, showing particular prowess in cross-lingual emotion recognition.
Instruction-Following Voice Generation: The model excels in generating speech that conforms to diverse user instructions regarding style and emotion, achieving high accuracy in instruction adherence.

Implications and Future Developments

MinMo represents a significant enhancement in voice interaction systems, addressing the critical limitations of sequence length discrepancies and data imbalances while preserving the capabilities of text-based LLMs. Its ability to manage full-duplex interactions seamlessly underlines its potential for real-time applications, paving the way for more sophisticated voice-driven interfaces in AI systems.

Future developments could build upon MinMo's architecture to further reduce latency, expand the range of supported languages and dialects, and enhance instruction-following capabilities through more extensive training and data scaling. MinMo sets a precedent for integrating multimodal capabilities into LLMs, advancing the boundaries of voice interaction systems in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (36)

Qian Chen (264 papers)
Yafeng Chen (26 papers)
Yanni Chen (11 papers)
Mengzhe Chen (6 papers)
Yingda Chen (13 papers)
Chong Deng (22 papers)
Zhihao Du (30 papers)
Ruize Gao (11 papers)
Changfeng Gao (7 papers)
Zhifu Gao (28 papers)
Yabin Li (4 papers)
Xiang Lv (15 papers)
Jiaqing Liu (20 papers)
Haoneng Luo (7 papers)
Bin Ma (78 papers)
Chongjia Ni (18 papers)
Xian Shi (50 papers)
Jialong Tang (17 papers)
Hui Wang (371 papers)
Hao Wang (1119 papers)

Related Papers

Find Related Papers

GitHub

MinMo

Tweets

https://twitter.com/ksankar77/status/1880232774541685135