Fun-Audio-Chat: Voice AI Conversation Model
- Fun-Audio-Chat is a large audio language model that powers real-time, high-fidelity voice conversations using dual-resolution speech representations.
- It employs innovative techniques such as token grouping, a Speech Refined Head, and multi-task DPO to reduce GPU cost by about 50% while boosting performance.
- The framework supports full-duplex interaction and open-source reproducibility, setting a new benchmark for voice-centric AI assistants.
Fun-Audio-Chat is a large audio LLM (LALM) framework for real-time, high-fidelity, and instruction-following voice-based AI conversations, achieving state-of-the-art efficiency and performance in speech-to-speech and speech-to-text tasks among similar-scale systems. It advances conversational agents by blending technical innovations in temporal resolution, catastrophic forgetting avoidance, task-aligned preference learning, computational efficiency, and full-duplex voice interaction (Chen et al., 23 Dec 2025).
1. Architectural Innovations: Dual-Resolution Speech Representations
A pivotal challenge in LALMs is the temporal resolution mismatch: semantic speech tokens arrive at 25 Hz, whereas text tokens typically operate at ≈3 Hz. Handling all tokens at 25 Hz is both computationally intensive and degrades the LLM's semantic modeling capacity. Fun-Audio-Chat introduces Dual-Resolution Speech Representations (DRSR), which reorganize the temporal structure of the input:
- Token grouping: The speech token sequence at Hz is grouped into windows of , producing input at Hz. The grouped representation:
enables alignment with text token rates for the Shared LLM backbone.
- Speech Refined Head (SRH): For high-fidelity audio generation, SRH ungroups the LLM output back to 25 Hz:
and each sub-frame is fed to an autoregressive SRH predicting audio tokens with loss:
This dual-rate paradigm yields ≈50% reduction in GPU cost and improved modeling efficiency compared to single-resolution approaches (Chen et al., 23 Dec 2025).
2. Robust Training: Core-Cocktail and Multi-Task DPO
Addressing catastrophic forgetting and multi-task robustness, Fun-Audio-Chat employs a two-stage Core-Cocktail Training regimen:
- Stage 1: Rapid supervised adaptation using high-quality TTS-filtered speech/text; learning rate cosine annealed from to .
- Intermediate merging: A weighted merge of the rapid-adapted model and the pre-adapted model :
- Stage 2: Stable refinement on the same dataset, annealing learning rate further.
After this, multi-task Direct Preference Optimization (DPO) is performed:
where each models preferences on: (1) robustness (noise/diversity), (2) instruction-following (emotion/style/prosody control), (3) audio understanding, (4) voice empathy. This preference-driven stage uses real-speech data and balances the attribute weights () to tune the model for practical conversational performance (Chen et al., 23 Dec 2025).
3. Speech/Text Generation, Model Configuration, and Computational Efficiency
Fun-Audio-Chat operates with parallel joint speech-text modeling. At each generative step :
There are two model sizes:
- 8B dense parameter version ("Fun-Audio-Chat-8B"),
- 30B MoE variant with 3B active parameters per step ("Fun-Audio-Chat-30B-A3B").
The system flow is:
- Audio input is encoded (Whisper-Large-v3 encoder + adapter).
- Grouped audio tokens (5 Hz) enter the LLM.
- LLM hidden states are used to simultaneously drive text and speech generation via dedicated output heads.
- Speech tokens pass through flow matching and HiFi-GAN modules for waveform synthesis.
Frame rate reduction to 5 Hz in the backbone enables 1.25×–5× less computation compared to standard 12.5–25 Hz models. This achieves per-chunk voice-assistant inference latencies under 100 ms and ≈50% reduction in training GPU-hours relative to prior paradigms (Chen et al., 23 Dec 2025).
4. Full-Duplex Conversational Capability
Fun-Audio-Chat-Duplex extends the base system to support simultaneous speech/text input streams, enabling full-duplex interaction:
- Parallel input streams: The model can accept user audio (speaking) concurrent with assistant speech generation.
- Training: Requires augmentation of half-duplex dialogues into concurrent stream simulations; training continues from the Core-Cocktail checkpoint.
Empirical evaluation on UltraEvalAudio yields the highest turn-taking rates (99.9–100%) and leading average S2M-T/S2M-S metrics (54.9/49.3 for the 30B-A3B model), outperforming other contemporary systems (Chen et al., 23 Dec 2025).
5. Benchmark Results and Comparative Performance
Fun-Audio-Chat demonstrates superior or highly competitive performance relative to similar-scale models on standard spoken question answering, audio understanding, and voice empathy tasks:
| Model | In (Hz) | Out (Hz) | OpenAudioBench (%) | VoiceBench (%) | UltraEval-Audio (%) |
|---|---|---|---|---|---|
| GLM-4-Voice (9B) | 12.5 | 12.5+τ | 57.7 | 59.8 | 42.4 |
| MiniCPM-o 2.6 (7B) | 25 | τ | 62.6 | 71.7 | 48.1 |
| Kimi-Audio (7B) | 12.5 | 12.5 | 69.1 | 76.9 | 42.8 |
| MiMo-Audio (7B) | 6.25 | 6.25+τ | 65.5 | 74.1 | 55.5 |
| Fun-Audio-Chat-8B | 5 | 5 | 76.6 | 83.2 | 59.6 |
On Audio Understanding, Fun-Audio-Chat-30B-A3B achieves 77.9% (MMAU test) and 59.9% (MMAU-Pro), with semantic-based voice empathy 4.80/5 and paralinguistic empathy 3.55/5, surpassing comparable open and closed systems (Chen et al., 23 Dec 2025).
6. Open-Source Release and Reproducibility
Fun-Audio-Chat-8B and its training/inference code are open-sourced:
- Code repositories: https://github.com/FunAudioLLM/Fun-Audio-Chat
- HuggingFace: https://huggingface.co/FunAudioLLM/Fun-Audio-Chat-8B
- Interactive demo: https://funaudiollm.github.io/funaudiochat
The release includes the full training script pipeline (prealignment → Core-Cocktail → DPO → Duplex) and inference scripts for deployment (Chen et al., 23 Dec 2025).
7. Broader Context and Significance
By integrating dual-resolution token modeling, catastrophic forgetting mitigation, strong instruction-following, voice empathy, and efficient full-duplex operation, Fun-Audio-Chat sets a computational and interaction benchmark for the next generation of voice-centric AI assistants. The design addresses longstanding challenges in resolution alignment, catastrophic knowledge transfer, and resource consumption that have limited the scalability and responsiveness of prior LALMs. Its open-source character and reproducible implementation serve as a foundation for further development of real-time, expressive, voice-based conversational systems (Chen et al., 23 Dec 2025).