Papers
Topics
Authors
Recent
2000 character limit reached

Fun-Audio-Chat: Voice AI Conversation Model

Updated 30 December 2025
  • Fun-Audio-Chat is a large audio language model that powers real-time, high-fidelity voice conversations using dual-resolution speech representations.
  • It employs innovative techniques such as token grouping, a Speech Refined Head, and multi-task DPO to reduce GPU cost by about 50% while boosting performance.
  • The framework supports full-duplex interaction and open-source reproducibility, setting a new benchmark for voice-centric AI assistants.

Fun-Audio-Chat is a large audio LLM (LALM) framework for real-time, high-fidelity, and instruction-following voice-based AI conversations, achieving state-of-the-art efficiency and performance in speech-to-speech and speech-to-text tasks among similar-scale systems. It advances conversational agents by blending technical innovations in temporal resolution, catastrophic forgetting avoidance, task-aligned preference learning, computational efficiency, and full-duplex voice interaction (Chen et al., 23 Dec 2025).

1. Architectural Innovations: Dual-Resolution Speech Representations

A pivotal challenge in LALMs is the temporal resolution mismatch: semantic speech tokens arrive at 25 Hz, whereas text tokens typically operate at ≈3 Hz. Handling all tokens at 25 Hz is both computationally intensive and degrades the LLM's semantic modeling capacity. Fun-Audio-Chat introduces Dual-Resolution Speech Representations (DRSR), which reorganize the temporal structure of the input:

  • Token grouping: The speech token sequence S=[s0,s1,,sT1]\mathbf{S} = [s_0, s_1, \dots, s_{T-1}] at fs=25f_s = 25 Hz is grouped into windows of k=5k=5, producing input at fb=5f_b = 5 Hz. The grouped representation:

gi=Linear([e(sik),,e(s(i+1)k1)])Rdtext\mathbf{g}_i = \mathrm{Linear}\left(\left[\mathbf{e}(s_{ik}),\dots,\mathbf{e}(s_{(i+1)k-1})\right]\right) \in \mathbb{R}^{d_\mathrm{text}}

enables alignment with text token rates for the Shared LLM backbone.

  • Speech Refined Head (SRH): For high-fidelity audio generation, SRH ungroups the LLM output back to 25 Hz:

[hug(1),,hug(k)]=Splitk(WphL(LLM))\left[\mathbf{h}_{ug}^{(1)},\dots,\mathbf{h}_{ug}^{(k)}\right] = \mathrm{Split}_k\left(W_p\,\mathbf{h}_L^{(\mathrm{LLM})}\right)

and each sub-frame is fed to an autoregressive SRH predicting audio tokens with loss:

LSRH=i=1TlogP(sis<i,H<i)\mathcal{L}_\mathrm{SRH} = -\sum_{i=1}^{T}\log P\left(s_i \mid s_{<i},\mathbf{H}_{<i}\right)

This dual-rate paradigm yields ≈50% reduction in GPU cost and improved modeling efficiency compared to single-resolution approaches (Chen et al., 23 Dec 2025).

2. Robust Training: Core-Cocktail and Multi-Task DPO

Addressing catastrophic forgetting and multi-task robustness, Fun-Audio-Chat employs a two-stage Core-Cocktail Training regimen:

  • Stage 1: Rapid supervised adaptation using high-quality TTS-filtered speech/text; learning rate cosine annealed from 10410^{-4} to 10510^{-5}.
  • Intermediate merging: A weighted merge of the rapid-adapted model M1M_1 and the pre-adapted model M0M_0:

Mr=αM1+(1α)M0,α=0.5M_r = \alpha\,M_1 + (1 - \alpha)\,M_0\,,\quad \alpha=0.5

  • Stage 2: Stable refinement on the same dataset, annealing learning rate further.

After this, multi-task Direct Preference Optimization (DPO) is performed:

LmultiDPO=k=14wkLDPO(k)\mathcal{L}_\mathrm{multiDPO} = \sum_{k=1}^4 w_k\,\mathcal{L}_\mathrm{DPO}^{(k)}

where each LDPO(k)\mathcal{L}_\mathrm{DPO}^{(k)} models preferences on: (1) robustness (noise/diversity), (2) instruction-following (emotion/style/prosody control), (3) audio understanding, (4) voice empathy. This preference-driven stage uses real-speech data and balances the attribute weights (kwk=1\sum_k w_k = 1) to tune the model for practical conversational performance (Chen et al., 23 Dec 2025).

3. Speech/Text Generation, Model Configuration, and Computational Efficiency

Fun-Audio-Chat operates with parallel joint speech-text modeling. At each generative step tt:

P(yty<t,x)=P(st,tts<t,t<t,x)P(y_t \mid y_{<t}, x) = P(s_t, t_t \mid s_{<t}, t_{<t}, x)

There are two model sizes:

  • 8B dense parameter version ("Fun-Audio-Chat-8B"),
  • 30B MoE variant with 3B active parameters per step ("Fun-Audio-Chat-30B-A3B").

The system flow is:

  1. Audio input is encoded (Whisper-Large-v3 encoder + adapter).
  2. Grouped audio tokens (5 Hz) enter the LLM.
  3. LLM hidden states are used to simultaneously drive text and speech generation via dedicated output heads.
  4. Speech tokens pass through flow matching and HiFi-GAN modules for waveform synthesis.

Frame rate reduction to 5 Hz in the backbone enables 1.25×–5× less computation compared to standard 12.5–25 Hz models. This achieves per-chunk voice-assistant inference latencies under 100 ms and ≈50% reduction in training GPU-hours relative to prior paradigms (Chen et al., 23 Dec 2025).

4. Full-Duplex Conversational Capability

Fun-Audio-Chat-Duplex extends the base system to support simultaneous speech/text input streams, enabling full-duplex interaction:

  • Parallel input streams: The model can accept user audio (speaking) concurrent with assistant speech generation.
  • Training: Requires augmentation of half-duplex dialogues into concurrent stream simulations; training continues from the Core-Cocktail checkpoint.

Empirical evaluation on UltraEvalAudio yields the highest turn-taking rates (99.9–100%) and leading average S2M-T/S2M-S metrics (54.9/49.3 for the 30B-A3B model), outperforming other contemporary systems (Chen et al., 23 Dec 2025).

5. Benchmark Results and Comparative Performance

Fun-Audio-Chat demonstrates superior or highly competitive performance relative to similar-scale models on standard spoken question answering, audio understanding, and voice empathy tasks:

Model In (Hz) Out (Hz) OpenAudioBench (%) VoiceBench (%) UltraEval-Audio (%)
GLM-4-Voice (9B) 12.5 12.5+τ 57.7 59.8 42.4
MiniCPM-o 2.6 (7B) 25 τ 62.6 71.7 48.1
Kimi-Audio (7B) 12.5 12.5 69.1 76.9 42.8
MiMo-Audio (7B) 6.25 6.25+τ 65.5 74.1 55.5
Fun-Audio-Chat-8B 5 5 76.6 83.2 59.6

On Audio Understanding, Fun-Audio-Chat-30B-A3B achieves 77.9% (MMAU test) and 59.9% (MMAU-Pro), with semantic-based voice empathy 4.80/5 and paralinguistic empathy 3.55/5, surpassing comparable open and closed systems (Chen et al., 23 Dec 2025).

6. Open-Source Release and Reproducibility

Fun-Audio-Chat-8B and its training/inference code are open-sourced:

The release includes the full training script pipeline (prealignment → Core-Cocktail → DPO → Duplex) and inference scripts for deployment (Chen et al., 23 Dec 2025).

7. Broader Context and Significance

By integrating dual-resolution token modeling, catastrophic forgetting mitigation, strong instruction-following, voice empathy, and efficient full-duplex operation, Fun-Audio-Chat sets a computational and interaction benchmark for the next generation of voice-centric AI assistants. The design addresses longstanding challenges in resolution alignment, catastrophic knowledge transfer, and resource consumption that have limited the scalability and responsiveness of prior LALMs. Its open-source character and reproducible implementation serve as a foundation for further development of real-time, expressive, voice-based conversational systems (Chen et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fun-Audio-Chat.