Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

FunAudio-ASR: Hybrid ASR with LLM Integration

Updated 18 September 2025
  • The paper introduces a hybrid ASR system that leverages transformer-based audio encoders, LLM semantic reasoning, RAG, and RL to boost transcription accuracy and reduce latency.
  • It employs a two-stage pretraining and an adaptor module to effectively translate acoustic signals into LLM-compatible inputs, mitigating LLM hallucinations and ensuring reliability.
  • Optimizations such as noise augmentation, code-switching training, and RL fine-tuning lead to competitive WERs and enhanced domain customization for real-world applications.

FunAudio-ASR is a large-scale automatic speech recognition (ASR) system architected for robust generalization and competitive accuracy in both academic benchmarks and production environments. Central to its design is a hybrid framework that synergizes transformer-based audio encoders, deep integration with LLMs, retrieval-augmented generation (RAG) strategies, and reinforcement learning (RL) protocols. Its development reflects recent trends in ASR: scaling data and model capacity, leveraging LLMs for semantic reasoning, and optimizing for deployment requirements such as streaming, noise robustness, multilinguality, and domain customization (An et al., 15 Sep 2025).

1. Architecture and System Components

FunAudio-ASR consists of four primary modules: an audio encoder, an audio adaptor, a Connectionist Temporal Classification (CTC) decoder, and an LLM-based decoder. The main architecture employs transformer encoder layers to extract deep representations from raw audio. Two-stage pretraining is used on the encoder: unsupervised pretraining via Best-RQ and supervised training through an attention-based encoder-decoder paradigm.

The audio adaptor serves as a bridge, transforming acoustic encoder outputs into representations compatible with the LLM. The CTC decoder generates initial hypotheses for transcription and hotword retrieval, providing a strong inductive prior that guides the downstream LLM-based decoder. The system operates with two primary parameterizations: a 0.7B audio encoder with a 7B LLM decoder for maximal accuracy, and a "nano" variant with 0.2B/0.6B encoder/decoder for low-resource inference. Model selection depends on the latency and deployment constraints.

2. LLM Integration and Hallucination Mitigation

A defining innovation of FunAudio-ASR is the deep integration between the audio encoder and the LLM-based decoder, mediated by the adaptor module. Unlike classical ASR pipelines where the decoder relies solely on acoustic model outputs, FunAudio-ASR fuses semantic context from large pretrained text models to enhance final transcription accuracy.

To prevent LLM-driven hallucination—erroneous text generation not contained in the audio signal—the following strategies are implemented:

  • The CTC decoder produces preliminary transcription hypotheses that serve as anchors for the LLM decoder, constraining its output.
  • Reinforcement learning is employed with custom value functions penalizing hallucinations and language mismatches. Regex-based content detection verifies that outputs do not invent non-existent speech segments.
  • Language consistency rewards and penalties ensure that output transcriptions align with input audio language.

These mechanisms collectively impart robustness against LLM-generated spurious text, a documented failure mode in production ASR use cases.

3. Real-World Optimization Strategies

FunAudio-ASR incorporates several optimizations crucial for industry deployment:

  • Streaming ASR: The model is trained with simulated chunked inputs, presenting only past context during training. This matches real-time inference conditions, supporting low-latency applications (e.g., live captioning).
  • Noise Augmentation: 110K hours of low-noise speech are synthetically mixed with noise samples (average SNR 10 dB), and 30% of utterances are augmented online. Zero-padding prior to noise ensures model exposure to pure noise regions.
  • Code-Switching: The system robustly transcribes Chinese-English mixed speech. 40K English keywords are synthesized using Qwen3 and converted to speech for training, forming a targeted code-switching corpus.
  • Hotword Customization: RAG enables domain-specific hotword retrieval. Editing distances between CTC outputs and hotword dictionaries are computed to retrieve candidates, which are then routed to the LLM decoder for transcription refinement.

These enhancements collectively address requirements such as domain adaptation, handling of background noise, and support for multilingual or mixed-language environments.

4. Performance Evaluation

FunAudio-ASR demonstrates strong empirical performance across academic and production datasets:

  • Benchmarks: On AIShell-1, Librispeech, and Fleurs, FunAudio-ASR achieves competitive or superior Word Error Rates (WERs) compared to recent ASR baselines (Paraformer-v2, Kimi-Audio, FireRedASR, Seed-ASR).
  • Streaming/Real-world Scenarios: In industry datasets covering canteens, meetings, and outdoor backgrounds, FunAudio-ASR reports streaming test average WERs as low as 6.66–7.00%, surpassing commercial ASR APIs.
  • Customization Metrics: Hotword recall and accuracy rates reach up to 0.97, with recall for specific domains (e.g., names) improving from 0.75 to 1.00 after customization.
  • Reinforcement Learning Improvements: RL-based fine-tuning with the FunRL framework leads to 4–9% relative WER reductions by refining insertion and deletion errors.

A table from the paper summarizes these metrics (actual table from (An et al., 15 Sep 2025)):

Scenario WER (Streaming) Recall (Hotword)
Canteen 6.66% 0.97
Meeting 7.00% 0.96
Names (custom) -- 1.00

5. Reinforcement Learning: FunRL

FunAudio-ASR utilizes a custom RL framework, FunRL, tailored for joint optimization of audio and LLM components. Key properties:

  • Batch inference of audio embeddings on GPU, followed by LLM rollouts (using SGLang).
  • Reward computation includes ASR accuracy (1–WER normalization), keyword accuracy/recall, noise robustness, hallucination suppression, and language consistency.
  • The policy loss is a clipped objective with KL regularization, structurally paralleling MWER optimization.
  • Ray-based orchestration enables efficient GPU resource alternation for RL updates.
  • The advantage function is computed as A^i,t=[Rimean({Rj})]/std({Rj})\hat{A}_{i,t} = [R_i - \text{mean}(\{R_j\})]/\text{std}(\{R_j\}).

This RL regimen is central to fine-tuning ASR performance, especially for streaming and domain-adaptive tasks.

6. Research Outlook and Future Development

Identified future directions for FunAudio-ASR include:

  • Language Expansion: Current multilingual support is limited (Chinese, English, FunAudio-ASR-ML). Extended coverage of additional languages is planned.
  • Long-context Audio Processing: Handling extended recordings remains challenging without a voice activity detection (VAD) module. There is scope for end-to-end architectures that natively manage long-context segmentation.
  • Far-field and Multi-channel Audio: Deployment is currently optimized for near-field recordings. Expansion to far-field and multi-array audio settings is prioritized for broader applicability.

A plausible implication is that the system’s modular and RL-driven design lends itself to rapid adaptation for new languages, acoustic environments, and domain-specific vocabularies.

7. Significance and Application Scope

FunAudio-ASR exemplifies the convergence of audio transformer modeling, LLM-guided semantic inference, reinforcement learning, and practical optimizations for modern ASR. It addresses persistent challenges in industry deployment—LLM hallucination, streaming latency, acoustic variability, code-switching, and hotword recall—via a cohesive end-to-end framework. The architecture is designed to be extensible for future directions such as broader multilingualism, improved memory for long audio, and more sophisticated audio scene understanding. The system demonstrates the state-of-the-art in deploying LLM-based ASR solutions robustly and efficiently in production scenarios (An et al., 15 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FunAudio-ASR.