FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs (2407.04051v3)

Published 4 Jul 2024 in cs.SD, cs.AI, and eess.AS

Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and LLMs. At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-LLM.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

PDF HTML Abstract

FunAudioLLM: Advancing Natural Voice Interaction Between Humans and LLMs

This report details FunAudioLLM, a comprehensive suite devised to enhance natural voice interactions between humans and LLMs. Developed by Alibaba Group's Tongyi SpeechTeam, FunAudioLLM incorporates two innovative models: SenseVoice for speech recognition and understanding, and CosyVoice for speech generation. These models collectively facilitate applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration. Both models, along with their training and inference codes, have been open-sourced, underscoring a commitment to advancing the state of voice interaction technology.

Core Technological Innovations

SenseVoice: Advanced Voice Understanding

SenseVoice is introduced as a state-of-the-art voice understanding model available in two variants: SenseVoice-Small and SenseVoice-Large. The former supports multilingual recognition across five languages with exceptionally low latency, delivering performance that is markedly faster than existing alternatives such as Whisper. Meanwhile, SenseVoice-Large extends support to over 50 languages, enhancing high-precision automatic speech recognition (ASR) while excelling in emotion recognition and audio event detection. This architecture enables robust and real-time applications, significantly advancing multilingual ASR capabilities with reduced latency and increased accuracy.

CosyVoice: High-Quality Voice Generation

CosyVoice focuses on generating natural-sounding speech across multiple languages, leveraging a comprehensive training dataset encompassing five languages and substantial hours of audio data. Distinctively, CosyVoice excels in zero-shot learning, cross-lingual voice cloning, and instruction-following, providing extensive flexibility in generating contextually appropriate and emotionally expressive speech. Notably, CosyVoice includes three models: CosyVoice-base-300M, CosyVoice-instruct-300M, and CosyVoice-sft-300M, each tailored for specific generative and instructional tasks.

Key Capabilities and Results

Multilingual Speech Recognition

The multilingual speech recognition capabilities of SenseVoice are evaluated against leading benchmarks such as AISHELL-1, AISHELL-2, WenetSpeech, and LibriSpeech. In comparative tests, SenseVoice-Small and SenseVoice-Large consistently outperform Whisper models, particularly in delivering superior recognition accuracy for under-resourced languages. SenseVoice-Small achieves significantly lower inference latency, performing more than 5 to 15 times faster than equivalent Whisper models.

Emotion Recognition (SER)

Evaluations on multiple speech emotion datasets showcase SenseVoice's exceptional capabilities in emotion recognition. SenseVoice-Large demonstrates state-of-the-art performance, significantly surpassing existing models such as XLSR-SER and SALMONN across various datasets. This indicates strong generalization and robustness of SenseVoice in recognizing diverse emotional states in speech.

Audio Event Detection

SenseVoice's audio event detection performance is benchmarked against models like BEATs and PANNs. Although specialized models occasionally outperform SenseVoice, its integration of ASR, emotion recognition, and audio event detection tasks achieves competitive F1 scores, highlighting its versatility in diverse voice processing tasks.

Theoretical and Practical Implications

The development of FunAudioLLM has broad theoretical and practical implications. Theoretically, it underscores the importance of integrated voice understanding and generation frameworks in enhancing human-computer interaction. The combination of high-precision ASR, emotion recognition, intuitive voice generation, and instructional control advances the frontier of natural language processing.

Practically, FunAudioLLM paves the way for numerous applications:

Speech-to-Speech Translation: Enabling cross-lingual voice communication with personal voice preservation.
Emotional Voice Chat: Facilitating emotionally aware voice interactions for more human-like machine responses.
Interactive Podcasts: Allowing real-time, dynamic interactions within podcasting that include multiple characters or participants.
Expressive Audiobook Narration: Delivering nuanced and emotionally resonant audiobook experiences.

Future Directions

Future developments in AI and voice interaction could benefit from further extending the language support scope, enhancing real-time processing capabilities, and integrating end-to-end training for combined voice understanding and generation models to reduce error propagation. Moreover, improving the model's ability to infer emotions and styles from contextual semantic content and refining expressive changes in generated speech will further bolster the naturalness of human-LLM interactions.

In conclusion, FunAudioLLM represents a significant advancement in voice interaction technologies, combining robust understanding and generative models to facilitate more natural, seamless, and emotionally aware interactions between humans and LLMs. Through its open-source approach, it invites further exploration and refinement, promising continued progress in the field of AI-driven voice interactions.