FunAudioLLM: Next-Gen Voice Interaction Model
- FunAudioLLM is a modular foundation model for voice interactions, combining advanced ASR, emotion/event recognition, and high-fidelity TTS.
- It leverages zero-shot controllable techniques and in-context voice cloning to deliver expressive, multilingual speech synthesis and real-time dialogue.
- The open-source pipeline supports diverse applications such as interactive translation, audiobook narration, and emotional chatbot communication.
FunAudioLLM is a foundation model family designed for comprehensive, controllable, and expressive voice-based interactions between humans and LLMs. It combines modular speech understanding, emotion and event recognition, multilingual capability, and high-fidelity, zero-shot controllable voice generation with open-source code and model access. The system architecture, inference and training pipelines, and quantitative performance metrics position FunAudioLLM as a central building block for next-generation natural voice interfaces, audiobooks, translation agents, expressive chatbots, and creative speech/music applications (An et al., 2024).
1. System Architecture and Component Models
FunAudioLLM adopts a modular pipeline architecture centered around two main components:
- SenseVoice: A speech-understanding family supporting multilingual automatic speech recognition (ASR), speech emotion recognition (SER), audio event detection (AED), punctuation, and language ID.
- CosyVoice: A speech generation/modeling suite supporting multi-speaker, multi-style, cross-lingual and zero-shot controllable text-to-speech (TTS), paralinguistic voice control, and in-context speaker cloning.
These models interface with a general-purpose LLM (e.g., Qwen or LLaMA-variant) to enable natural dialog, language understanding, and content creation. The canonical inference pipeline is:
- User speech → SenseVoice: recognition, emotion/event tagging.
- Transcribed text + tags → LLM: language understanding, content planning, or translation.
- Generated text + control metadata → CosyVoice: speech synthesis in target language, style, or voice.
This design allows for decoupled advances in understanding, generation, and high-level reasoning, resulting in:
- Real-time or low-latency deployment for voice interaction.
- Flexible integration into downstream applications such as conversational agents, podcast generation, or real-time translation (An et al., 2024).
2. SenseVoice: Speech Understanding Models
SenseVoice-Small (234M parameters) is an encoder-only, non-autoregressive model based on memory-equipped self-attention (SAN-M):
- Input: 80-dimensional log-Mel frames stacked and downsampled ×6.
- Task control: Special tokens prepended (e.g., ⟨LID⟩, ⟨SER⟩, ⟨AED⟩, ⟨ITN⟩).
- Output: Softmax head on encoder for fully parallel connectionist temporal classification (CTC) decoding.
- Performance: Latency ≲80 ms; CER/WER and SER/AED F1 scores at or above strong baselines.
SenseVoice-Large (1.6B parameters) extends to an encoder–decoder Transformer with autoregressive decoding (beam search), supporting over 50 languages and fine-grained event timestamps.
Training: Combines 300k hours of 5-language data for Small; additional 100k hours of diverse speech (over 50 languages) for Large; pseudo-labels for emotion/event detection (An et al., 2024).
ASR and SER/AED Results:
Selected results on recognized benchmarks:
| Dataset | Whisper-S | Whisper-L | Sense-S | Sense-L |
|---|---|---|---|---|
| AISHELL-1 (CER) | 10.04 | 5.14 | 2.96 | 2.09 |
| Librispeech clean | 3.13 | 1.82 | 3.15 | 2.57 |
| CASIA (SER, F1 %) | 56.3(SOTA) | -- | 70.3 | 95.5 |
This demonstrates SenseVoice's strong performance across tasks (An et al., 2024).
3. CosyVoice: Speech Generation and Control
CosyVoice comprises a hierarchical generative model:
- (A) Semantic Speech Tokenizer (): First 6 layers of SenseVoice-Large plus a single vector quantizer, yielding 4096 discrete tokens at 50 Hz frequency for high phonetic/paralinguistic fidelity.
- (B) AR Token LLM (300M): Autoregressive model over tokens, optionally conditioned on speaker/style/timestamp tokens. Uses standard cross-entropy objectives.
- (C) Flow-Matching ODE Model: Maps generated tokens + speaker/stylistic conditioning into mel-spectrograms. Employs classifier-free guidance with conditional masking.
- (D) HiFTNet Vocoder: Converts mels to waveform, supporting streaming output.
Capabilities:
- Multi-lingual, expressive, and controllable TTS.
- Zero-shot in-context voice cloning.
- Cross-lingual voice conversion by reusing embeddings and Mel from prompt audio.
- Fine-grained style/role/emotion controllability using instruction tuning and style tokens.
Zero-shot and Instructional Control:
- For zero-shot cloning: prepend 3–10 s of tokens and synthesize continuation.
- Instruction-fine-tuned variant (CosyVoice-instruct) learns response to rich style prompts (e.g., "speak with an elegant whisper") (An et al., 2024).
4. Training Pipelines and Open-Source Assets
Data: Hundreds of thousands of hours for ASR, TTS, and dialog alignment across dozens of languages and speakers; millions of pseudo-labels for emotion/event tags.
Fine-tuning & API Use:
Example usage (sketch, (An et al., 2024)):
1 2 3 4 5 6 7 8 9 10 |
from funaudiollm import SenseVoiceForCTC, CosyVoiceForGeneration model_asr = SenseVoiceForCTC.from_pretrained("FunAudioLLM/sensevoice-small") text, emotion = model_asr.decode_with_emotion(audio) prompt = f"User: {text} (emotion: {emotion})" response = model_llm.generate(prompt) styled = f"<style={emotion}> {response}" wave = model_cosy.generate_speech(styled) |
Assets:
- Code (GitHub): https://github.com/FunAudioLLM
- Demo Notebooks and Apps: https://fun-audio-llm.github.io
- ModelScope/Huggingface releases: SenseVoice, CosyVoice, documentation and configs (An et al., 2024).
5. Quantitative Performance and Benchmarks
ASR & SER/AED:
- SenseVoice-Small achieves 2.96% CER on AISHELL-1, 3.15% WER on Librispeech clean, 70.3–95.5% F1 on major SER benchmarks.
- Latency: 70 ms (Sense-Small), 1.6 s (Sense-Large), outperforming Whisper Small/Large in speed-accuracy.
CosyVoice (LibriTTS & AISHELL-3 transcriptions):
| Model | WER (%) | Insert+Del | Speaker Sim (%) |
|---|---|---|---|
| Human | 2.66 | 92 | 69.67 |
| CosyVoice | 2.89 | 88.6 | 74.30 |
| CosyVoice + rerank | 1.51 | 47 | 74.30 |
| ChatTTS (baseline) | 8.32 | 441 | — |
Emotion Control (emu2vec acc.):
- Near-perfect prediction for happy, sad, fearful, disgusted (≥0.87) with CosyVoice-instruct.
Data augmentation for ASR:
- CosyVoice-synthesized data improves downstream ASR WER: e.g., Mix of real and synthetic training (Librispeech+MLS) yields WER 2.04% (test_clean) vs. 2.79% for real-only (An et al., 2024).
6. Applications and Extensibility
FunAudioLLM enables:
- Speech-to-speech translation: Recognize, translate, and re-synthesize sentences in the user's original voice using in-context cloning.
- Emotional voice chat: Emotionally-tagged recognition and synthesized responses with matching style.
- Audiobook narration: Analyzes character arcs and emotions, segments narration, and applies expressive style or specific voices.
- Interactive podcasting and dialog: Multi-turn conversations with controllable style, in any language.
- Augmented accessibility: Personalized or expressive content for visually impaired or language learners.
The pipeline is architecturally extensible to handle arbitrary languages, styles, or custom tasks by instruction-tuning, LoRA/adapter methods, or pseudo-labeling for new events or styles (An et al., 2024). All models and recipes are open-sourced, supporting reproducibility and further research.
7. Technical and Research Significance
FunAudioLLM constitutes a decisive advance in modular, open-source speech–LLM integration, providing:
- Streaming, low-latency, multi-task ASR.
- Human-level, parameter-efficient, zero-shot voice cloning.
- Open research APIs and benchmark datasets.
- A general, pipeline-compatible approach that can be transplanted to dialog, translation, narration, teaching, accessibility, and creative content generation.
By integrating speech comprehension, paralinguistic and event understanding, LLM reasoning, and expressive, controllable synthesis, FunAudioLLM delineates a pathway towards truly intelligent, creative, and accessible voice interfaces (An et al., 2024).