Seed-ASR: LLM Speech Recognition Paradigm

Updated 26 February 2026

Seed-ASR is a cutting-edge, context-aware speech recognition framework that integrates audio-conditioned LLMs with seed-based protocols for rapid adaptation across domains and languages.
It employs a multi-phase training regimen—including self-supervised learning, fine-tuning, context SFT, and reinforcement learning—to achieve significant reductions in error rates.
The system supports fast domain adaptation using minimal data and contextual prompts, making it effective for both high-resource and low-resource language applications.

Seed-ASR encompasses both a cutting-edge LLM-driven speech recognition architecture and a family of methodology protocols centered on the notion of foundational or “seed” ASR systems for rapid adaptation to new domains, languages, or data regimes. While the most recent advances employ audio-conditioned LLM frameworks for context-rich ASR, the Seed-ASR concept also includes pipelines for data creation in low-resource languages and domain context modeling. Below, primary methodologies, architectures, evaluation metrics, representative results, and deployment considerations are synthesized from recent research.

1. Audio-Conditioned LLM Architecture of Seed-ASR

Seed-ASR introduces an “audio-conditioned LLM” (AcLLM) paradigm that integrates contextual information and audio embeddings into a unified generative framework (Bai et al., 2024). The architecture consists of the following modules:

Prompt and Context Processing: The input sequence to the LLM begins with an instruction (e.g., “Transcribe the speech into text:”) and is optionally followed by contextual tokens (e.g., dialogue history, domain metadata).
Audio Encoder (“LUISE”):
- 32-layer Conformer (∼2 billion parameters), extracting 80-dimensional mel-filterbank features (25 ms window, 10 ms hop, processed at 100 Hz).
- Self-supervised masked-frame classification; a discrete tokenizer assigns codes every 40 ms; LUISE outputs continuous embeddings $h_t \in \mathbb{R}^d$ .
Converter and Projection: Contiguous output embeddings are concatenated (splicing four consecutive frames) and linearly projected into the LLM hidden space, resulting in $z_{t'} \in \mathbb{R}^D$ at a 160 ms stride.
Autoregressive Decoder (LLM): Mixture-of-Experts, decoder-only LLM, generating transcript $y = (y_1,...,y_N)$ conditioned causally on [prompt | context | audio] inputs.

All contextual information is prepended as text tokens. Context-sensitive generation probabilities are computed as $P(y|x,c) = \prod_{n=1}^N P(y_n|y_{<n},z,c)$ . Unlike traditional hybrid systems, no external LLM is connected at inference; the LLM’s knowledge is exploited directly (Bai et al., 2024).

2. Stage-wise Pretraining and Context Elicitation

The Seed-ASR training regime consists of four distinct phases, each targeting a specific capability:

Self-Supervised Learning (SSL) of Audio Encoder (LUISE): Trained on massive unlabelled speech corpora (7.7–12.4 million hours), with masked-frame classification; the LUISE encoder is then used as feature extractor.
Supervised Fine-Tuning (SFT): LUISE and the converter are trained (LLM frozen) on hundreds of thousands of hours of paired speech-text data across multiple domains and dialects. Loss: cross-entropy on next-token prediction.
Context SFT: The model is further trained on data triples $(\text{context},\ \text{speech},\ \text{text})$ to elicit natural language context integration, using an identical loss conditioned on both audio and contextual inputs. Joint beam search and acoustic pruning balance context with direct acoustic evidence.
Reinforcement Learning (RL): Initialized from Context SFT, RL minimizes minimum/weighted WER across N-best transcription hypotheses, optionally upweighting critical search keywords. The final loss is a convex combination:

$L_{RL} = L_{MWER} + \lambda L_{CE}$

where $L_{MWER}$ is the expected word error rate over N-best lists.

3. Contextualization and Domain Adaptation

Seed-ASR demonstrates robust domain adaptation and context tracking via its AcLLM design (Bai et al., 2024):

Contextual inputs are unified as textual tokens with no need for specialized cross-attention layers.
The system natively handles arbitrary, potentially long-form contextual histories (dialogue, meeting participants, domain tags, etc.).
Adaptation to new scenarios requires only tens to hundreds of domain-representative $(\text{context}, \text{speech}, \text{text})$ triples and one epoch of context SFT.
Prompt engineering and real-time instruction reweighting (e.g., inclusion of user IDs, keywords) can bias output style and recall.

In production, the joint beam-search $\alpha$ hyperparameter mediates the speech vs. context evidence trade-off.

4. Evaluation Metrics and Empirical Results

Seed-ASR achieves state-of-the-art performance across a wide spectrum of public and internal benchmarks, including multi-domain, accent/dialect, context, and code-switching sets (Bai et al., 2024):

Table 1. Chinese Public CER Benchmarks

Model	aishell-1	aishell-2 avg	wenetspeech avg	6-set avg
Paraformer-large	1.68	3.01	6.74/6.97	4.07
Qwen-Audio	1.30	3.23	9.50/10.87	5.23
Hubert+Baichuan2	0.95	3.50	6.06/6.26	3.96
Seed-ASR (CN)	0.68	2.27	4.66/5.69	2.98

Seed-ASR delivers a 24–40% relative reduction in CER on public Chinese test sets compared to previous large ASR/LLM systems.

Table 2. Internal Chinese Benchmarks

Model	Multi-domain WER (%)	Video-avg7 WER (%)	Hardcase F1 (%)
Transducer-E2E	3.68	3.92	90.42
Paraformer-large	5.23	5.97	87.99
Seed-ASR (CN)	1.94	2.70	93.72

Remarkable recall gains (>15% absolute over baselines) are observed in dialogue context scenarios with joint beam search and context SFT.

Table 3. English & Multilingual

System	En multi-dom WER	En accents WER	En hardcase F1	Multi-dom (8 langs) WER
Google USM	9.33	22.19	63.30	21.51
Whisper Large v3	10.41	21.52	79.54	20.55
Universal-1	9.95	14.40	77.82	–
Seed-ASR (ML)	5.34	11.26	87.94	12.16

Seed-ASR ML (multilingual model) achieves 10–40% WER reduction on public English/multilingual sets relative to the strongest baselines.

5. Extensions: Low-Resource Language Pipelines and Seed-ASR Protocols

The Seed-ASR framing extends beyond LLM-based models to seed-based protocols for rapid ASR bootstrapping in low-resource and domain-specific environments.

Audiobook Alignment Pipeline (Yeroyan et al., 2024): Converts hours-long, single-transcript audiobooks into 4–15s ASR-ready segments using Neural Forced Alignment (Conformer-CTC/Viterbi) or a VAD-ASR-CER (VAC) pipeline. Underpins rapid corpus creation for under-resourced languages.
- Segment splitting, silence trimming, crossfade smoothing, and WER-based filtering guarantee high-quality ASR training pairs.
- Case study: Armenian corpus segmentation expanded 3.7h to 20.93h of data; WER on audiobook test set improved from 0.39 (baseline) to 0.16 (MCV + audiobooks).
Domain Adaptation via Seed Lexicon Expansion (Gretter et al., 2021): Data selection and LM adaptation for hybrid ASR systems targeting specialized terminology. Morphological and semantic expansion pipelines select sentences maximizing seed word/term coverage, yielding substantial OOV rate and WER reduction in in-domain tasks.
Transfer Learning and Cross-lingual Adaptation (Inaguma et al., 2018, Geng et al., 14 Jul 2025): Seed models (multilingual S2S, pre-trained Speech Foundation Models) are fine-tuned on small amounts of real/synthetic speech data for new languages (e.g., SENĆOTEN, BABEL languages). LM fusion—shallow/deep/cold—is critical.
- TTS augmentation (FastSpeech2+HiFiGAN pipelines) and n-gram LM fusion drive word/character error rates down, even under high OOV rates.
Iterative Self-Supervised Alignment and Domain Adaptation (López et al., 2022): Seed ASR models with acoustic CTC loss are used to produce pseudo-alignments in new domains, filtered by unsupervised confidence scores, enabling domain adaptation without human annotation.

6. Deployment, Limitations, and Future Work

Deployment: Seed-ASR models require no external LM at inference, enabling out-of-the-box deployment in context-rich and domain-adaptive scenarios. For scenario adaptation, one epoch of context SFT on small, tailored datasets suffices (Bai et al., 2024). Prompt engineering and beam-search parameters are adjustable for on-site customization.
Limitations: The high model capacity (∼2B params Conformer, >10B LLM) increases runtime latency and memory footprint compared to lightweight end-to-end systems. Multilingual coverage (currently ~9 languages for ML) is ongoing. Ultra-long context window capacity is limited (~5 minutes); improvements via sparse attention are in progress.
Future Directions: Expanding language coverage, unifying ASR with translation/speaker ID within AcLLM, refining RL rewards for semantic tasks, and extending unsupervised bootstrapping are active research areas (Bai et al., 2024, Yeroyan et al., 2024, Inaguma et al., 2018).

7. Summary Table: Core Features (Seed-ASR LLM, Data Protocols)

Aspect	Seed-ASR (LLM) (Bai et al., 2024)	Data Protocols / Pipelines
Input Modality	[Prompt	Context
Audio Encoder	32-layer Conformer, SSL	Conformer/CTC, baseline ASR, TTS
Text Generation	Autoregressive LLM (10B+)	-
Contextualization Method	Text token prepending	n-gram LM, TTS/LM fusion
Adaptation Mode	SFT, Context SFT, RL	Forced alignment, VAC, word2vec-lex
LM/External Info at Inference	None required	Shallow/Cold fusion, data selection
Deployment	Domain prompt/fine-tune	Protocolized scripts/tools

Seed-ASR, as both model and methodology, represents the emergence of context-aware, prompt-driven, and rapidly adaptable speech recognition frameworks built atop “seed” models—whether for high-resource, context-dense settings or under-resourced languages and domains (Bai et al., 2024, Yeroyan et al., 2024, Gretter et al., 2021, Inaguma et al., 2018, López et al., 2022).