Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition (2407.04675v2)

Published 5 Jul 2024 in eess.AS and cs.SD

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra LLMs perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a LLM based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra LLMs. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

PDF HTML Abstract

Overview of "Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition"

"Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition" presents an innovative approach to automatic speech recognition (ASR) by leveraging LLMs. Built upon the framework of audio-conditioned LLMs (AcLLM), Seed-ASR aims to improve recognition accuracy across multiple domains, languages, accents, and dialects without relying on additional LLMs.

Core Features of Seed-ASR

The Seed-ASR model boasts a collection of features that distinguish it from traditional end-to-end models:

High Recognition Accuracy: Seed-ASR is trained on an extensive dataset comprising over 20 million hours of speech. The resulting model shows significant improvements in transcribing accuracy, with a reported 10%–40% reduction in word error rates (WER) on Chinese and English test sets compared to the current state-of-the-art models.
Large Model Capacity: The model utilizes an audio encoder with nearly 2 billion parameters and a Mixture of Experts (MoE) LLM comprising tens of billions of parameters. This architecture allows the model to leverage the scaling laws effectively, which is further corroborated by empirical evidence provided in the paper.
Multiple Language Support: Seed-ASR (CN) supports Mandarin and 13 Chinese dialects, while Seed-ASR (ML) extends to English and seven additional languages, with ongoing work to support up to 40 languages.
Context-aware Ability: The model integrates contextual information, such as historical dialogues and meeting details, to augment its recognition performance in various scenarios. This integration is particularly useful in boosting keyword recall rates.
Stage-wise Training: The training process is segmented into self-supervised learning (SSL), supervised fine-tuning (SFT), context-aware SFT, and reinforcement learning (RL). Each stage focuses on progressively enhancing different aspects of ASR performance.

Methodology and Architecture

SSL of Audio Encoder:

The audio encoder, LUISE, employs a large-scale SSL based on a conformer model to capture extensive representations from unsupervised speech data. This phase trains on tens of millions of hours of speech, ensuring that the encoder can handle a broad spectrum of speech variations.

SFT:

Post SSL, the model undergoes supervised fine-tuning, wherein encoded speech representations are mapped to their textual transcriptions. This stage also leverages the existing semantic knowledge and reasoning capabilities of the LLM.

Context SFT:

This stage enhances the model's ability to utilize contextual information effectively. The model is trained on <context, speech, text> triples, allowing it to infer speech content that is strongly dependent on preceding dialogue or specific scenarios.

Reinforcement Learning:

In the final stage, RL is employed to align the model's behavior with ASR evaluation metrics, such as WER. This stage incorporates MWER (minimum word error rate) and weighted WER to fine-tune the model further, emphasizing the accuracy of key semantic elements.

Evaluation and Results

Seed-ASR (CN) and Seed-ASR (ML) were evaluated on a comprehensive set of benchmarks:

Public Datasets: The models achieved state-of-the-art performance on several public Chinese and English ASR benchmarks, including AISHELL-1, AISHELL-2, and Librispeech, showing a WER reduction of up to 40% compared to existing strong models like Whisper and USM.
Multi-domain and Hardcase Sets: The models demonstrated superior performance in diverse settings, confirmed by significant WER reductions in multi-domain evaluation sets and improved keyword F1 scores in hardcase sets.
Dialects and Accents: Seed-ASR (CN) outperformed baselines on dialect and multi-accent sets, reducing WER by over 21% on single dialect test sets.
Contextual Awareness: By leveraging contextual information, the model demonstrated a substantial increase in keyword recall rates in dialogue evaluations, outstripping context FST biasing techniques used in strong end-to-end models.
Subjective Intelligibility: In subjective intelligibility evaluations, Seed-ASR (CN) often surpassed human transcribers, especially in professional and complex audio environments.

Implications and Future Directions

Seed-ASR sets a new benchmark in ASR by effectively utilizing LLMs, and its stage-wise training approach offers a robust method for developing high-performance speech recognition systems. The implications are wide-ranging, opening doors to more flexible and context-aware ASR applications in various sectors such as intelligent assistants, video captioning, and multilingual communications.

Future work will likely focus on:

Expanding Seed-ASR's multilingual support to include more languages.
Enhancing its long-form speech recognition capabilities.
Exploring multi-task learning within a single model framework.

The results affirm that Seed-ASR achieves considerable advancements in ASR technology, showcasing its robustness, adaptability, and applicability across a broad array of speech recognition tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (55)

Ye Bai (28 papers)
Jingping Chen (1 paper)
Jitong Chen (15 papers)
Wei Chen (1288 papers)
Zhuo Chen (319 papers)
Linhao Dong (16 papers)
Qianqian Dong (19 papers)
Yujiao Du (2 papers)
Kepan Gao (1 paper)
Lu Gao (20 papers)
Yi Guo (115 papers)
Minglun Han (10 papers)
Ting Han (15 papers)
Wenchao Hu (5 papers)
Xinying Hu (1 paper)
Yuxiang Hu (25 papers)
Deyu Hua (1 paper)
Lu Huang (30 papers)
Mingkun Huang (6 papers)
Youjia Huang (4 papers)

Citations (13)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/tuming628/status/1810252298859528467