Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration (2501.14350v1)

Published 24 Jan 2025 in eess.AS and cs.SD

Abstract: We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises two variants: FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging LLM capabilities. On public Mandarin benchmarks, FireRedASR-LLM (8.3B parameters) achieves an average Character Error Rate (CER) of 3.05%, surpassing the latest SOTA of 3.33% with an 8.4% relative CER reduction (CERR). It demonstrates superior generalization capability over industrial-grade baselines, achieving 24%-40% CERR in multi-source Mandarin ASR scenarios such as video, live, and intelligent assistant. FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture. On public Mandarin benchmarks, FireRedASR-AED (1.1B parameters) achieves an average CER of 3.18%, slightly worse than FireRedASR-LLM but still outperforming the latest SOTA model with over 12B parameters. It offers a more compact size, making it suitable for resource-constrained applications. Moreover, both models exhibit competitive results on Chinese dialects and English speech benchmarks and excel in singing lyrics recognition. To advance research in speech processing, we release our models and inference code at https://github.com/FireRedTeam/FireRedASR.

FireRedASR: Advancements in Mandarin Speech Recognition

The paper entitled "FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models" introduces two automatic speech recognition (ASR) models tailored specifically for Mandarin: FireRedASR-LLM and FireRedASR-AED. These models are designed to achieve superior performance and efficiency across varied speech recognition tasks, representing a significant stride in ASR technology.

Overview of Models

FireRedASR-LLM utilizes an innovative Encoder-Adapter-LLM framework, leveraging a LLM's (LLM) capabilities. With 8.3 billion parameters, this model achieves a Character Error Rate (CER) of 3.05% on public Mandarin benchmarks, delivering an 8.4% reduction in CER compared to previous state-of-the-art models. It excels in multi-source ASR scenarios, providing a CERR of 24% to 40% over industrial-grade baselines.

FireRedASR-AED, on the other hand, is a smaller model with 1.1 billion parameters that employs an Attention-based Encoder-Decoder (AED) architecture. It balances performance and computational efficiency, achieving a CER of 3.18%, surpassing competitors with significantly larger model sizes. This model is notably compact, rendering it ideal for resource-constrained applications.

Both models demonstrate notable versatility, performing well in Chinese dialects and English speech scenarios. They also excel in the niche aspect of singing lyrics recognition, achieving up to 67% CERR relative to industrial baselines.

Key Contributions and Implications

  1. Model Performance and Efficiency: The authors highlight the models' ability to achieve high accuracy with efficient use of parameters. FireRedASR models outperform the previous SOTA models while maintaining computational efficiency. This balance of performance and efficiency makes these models suitable for various industrial applications.
  2. Robustness in Real-World Scenarios: In testing across applications like live streaming and intelligent assistants, FireRedASR models returned outstanding results. Their robust performance aligns with the real-world application demands requiring dependable ASR models.
  3. Versatile Capabilities: Beyond standard ASR, the models also adapt to different linguistic contexts, including dialects and English, and even singing lyrics, showcasing their broad applicability beyond singular language environments.
  4. Open-Source Contribution: By releasing both models' weights and inference code, the authors aim to fuel ongoing research in speech processing while enabling broad application in modern speech interaction systems.

Architectural Details and Methodologies

FireRedASR-AED uses a Conformer-based encoder and Transformer-based decoder, integrating components from both architectures to achieve efficient sequence transduction. This model uses a substantial training corpus sourced from diverse and high-quality audio, manually transcribed to ensure better training results compared to models based on weakly-labeled datasets.

FireRedASR-LLM extends the AED model by incorporating an LLM through a designed adapter network. The adapter transitions the encoder's output into a format interpretable by the LLM, enabling it to further process the semantic intricacies of speech. This integration exploits pre-trained LLMs while employing Low-Rank Adaptation (LoRA) for efficient training enhancements.

Future Developments

The potential future developments from this research include enhancing model performance further, perhaps in integrating even more comprehensive datasets or developing more efficient integration strategies for LLMs. Expanding the language support beyond Mandarin and dialectal variants remains an avenue worth exploring. Continued exploration of scaling laws in model training may also yield more efficient methodologies for larger models.

In summary, the FireRedASR paper contributes significantly to the ASR community by enhancing Mandarin speech recognition's scope, precision, and efficiency. The models it presents not only advance current performance levels but also establish a robust framework that can be built upon for broader applications and further improvements in the field of ASR.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kai-Tuo Xu (2 papers)
  2. Feng-Long Xie (3 papers)
  3. Xu Tang (48 papers)
  4. Yao Hu (106 papers)
Github Logo Streamline Icon: https://streamlinehq.com