Papers
Topics
Authors
Recent
Search
2000 character limit reached

Paraformer-zh: Efficient Mandarin ASR

Updated 23 June 2026
  • Paraformer-zh is a non‐autoregressive ASR framework for Mandarin that integrates CIF predictors and GLM samplers to enable single-pass, parallel decoding.
  • It achieves state-of-the-art performance with up to 10× faster inference and competitive CER on large-scale industrial benchmarks.
  • The architecture supports contextual biasing and speaker attribution, enhancing hotword recognition and multi-speaker accuracy in real-world applications.

Paraformer-zh is an advanced non-autoregressive (NAR) end-to-end automatic speech recognition (ASR) framework for Mandarin Chinese, designed to achieve high recognition accuracy while providing substantial inference speedup over autoregressive (AR) transformer baselines. Developed and deployed at industrial scale, Paraformer-zh incorporates a continuous integrate-and-fire (CIF) predictor, a glancing LLM (GLM) sampler, and a parallel transformer decoder, enabling single-pass parallel decoding. Its architecture, loss formulation, and empirical results demonstrate the feasibility of closing the NAR/AR performance gap on large-vocabulary Mandarin ASR, including robust support for contextual and speaker-attributed extensions (Gao et al., 2022, Shi et al., 2023, Li et al., 2023).

1. Architectural Foundation and Model Components

Paraformer-zh is constructed from four principal modules: an input encoder, a CIF-based pointer/length predictor, a GLM-based sampler, and a parallel NAR decoder. The encoder is a deep stack of self-attentive SAN-M layers (50-layer in industrial deployments), mapping 80-dimensional frame-wise filterbank features to latent hidden states. The CIF predictor computes per-frame weights αt\alpha_t, transducing frame-level encodings into a variable-length sequence of acoustic embeddings E1:LE_{1:L'} aligned with target token boundaries. This mechanism predicts the number of output tokens and produces the requisite hidden variables for parallel decoding (Gao et al., 2022).

During training, the GLM sampler introduces partial ground-truth supervision by mixing ground-truth token embeddings with CIF output embeddings at positions sampled according to the Hamming distance between a first-pass prediction and the ground truth. The NAR decoder—a multi-layer (typically 16-layer, 2048-hidden) transformer—then consumes these embeddings in parallel to output the target character sequence. Unlike AR decoders, the entire sequence is produced in a single computational pass, yielding an inference real-time factor (RTF) >10×>10\times faster than AR baselines of comparable accuracy (Shi et al., 2023).

2. CIF Predictor and Length Modeling

Central to Paraformer-zh’s NAR capability is the CIF predictor, which accumulates weights αt\alpha_t to both estimate output sequence length and trigger boundary emissions for each output token. The sequence of acoustic embeddings is generated by “firing” at every accumulation surpassing a calculated threshold β=(t=1Tαt)/t=1Tαt\beta = (\sum_{t=1}^T \alpha_t) / \lceil\sum_{t=1}^T \alpha_t\rceil. The number of output tokens NN' is thus aligned to the true target NN, enforced by an MAE-based length prediction loss,

LMAE=Nt=1Tαt,\mathcal{L}_{\text{MAE}} = |N - \sum_{t=1}^T \alpha_t|,

combined with standard cross-entropy and, in later stages, minimum word error rate (MWER) loss via negative sampling (Gao et al., 2022).

3. Contextual Biasing and Hotword Customization

Paraformer-zh serves as the backbone for the SeACo-Paraformer system, which extends the architecture with an explicit semantic biasing module for hotword customization (Shi et al., 2023). User-supplied hotwords are encoded using a 2-layer LSTM to generate Z1:nZ_{1:n} and further processed by a 4-layer transformer bias decoder employing dual cross-attention streams (from both decoder and CIF outputs to the hotword encodings). The resulting outputs are fused and projected through a “BiasOutLayer” to yield token probabilities PbP_b over an extended symbol set (vocabulary plus a “no-bias” token).

Inference merges predictions from the Paraformer backbone (E1:LE_{1:L'}0) and the bias module step-wise, favoring bias output (E1:LE_{1:L'}1) at positions where a hotword is activated, otherwise defaulting to E1:LE_{1:L'}2. Large-scale hotword lists are efficiently handled with Attention Score Filtering (ASF): top-E1:LE_{1:L'}3 hotwords (typ. E1:LE_{1:L'}4) are selected based on attention activations, maintaining high recall and CER improvements as E1:LE_{1:L'}5 grows into the thousands (Shi et al., 2023).

4. Mathematical Formulation and Training Losses

Given speech E1:LE_{1:L'}6 and reference tokens E1:LE_{1:L'}7, computations proceed as:

  • E1:LE_{1:L'}8
  • E1:LE_{1:L'}9
  • >10×>10\times0
  • >10×>10\times1

Hotwords >10×>10\times2 are embedded (>10×>10\times3), and multi-head attention generates >10×>10\times4. The merged output >10×>10\times5 is combined with >10×>10\times6 to yield final token probabilities >10×>10\times7. The ASR loss is computed as standard cross-entropy, while the biasing loss,

>10×>10\times8

is used during bias module finetuning, with the ASR backbone frozen.

Negative-sample MWER loss and GLM/GLM-based masking further boost accuracy. Notably, Paraformer-zh employs joint objective optimization for cross-entropy, MAE, and MWER. Sampling factor >10×>10\times9 is empirically optimal for balancing substitution and convergence.

5. Empirical Results and Performance Benchmarks

Paraformer-zh attains state-of-the-art CER on public and industrial Mandarin benchmarks. On AISHELL-1, Paraformer achieves Dev/Test CERs of 4.6/5.2%, matching AR transformer accuracy but with a 12× RTF gain. On a 20,000-hour industrial Mandarin task, Paraformer-zh yields a CER of 14.07% (far-field) and 7.86% (common) with the MWER stage, maintaining a <2% relative gap to AR transformers. In the SeACo-Paraformer context, general ASR CER is ~3.2% across variants; hotword recall rises to ≈65% (ASF-augmented), with 30.2% relative CER reductions compared to prior contextual baselines as the hotword list scales from 231 to 4000 entries. Decoding speed with Paraformer-zh remains >10× over beam-searched AR (Shi et al., 2023, Gao et al., 2022).

6. Mandarin-Specific Implementation and Tokenization

All Paraformer-zh variants utilize Mandarin character-level modeling, with a symbol inventory of 4,500–5,000 characters. There is no use of subword modeling or byte-pair encoding; the same character embeddings are shared between ASR and contextual bias modules. Preprocessing leverages 80-dim log-Mel features, mean-variance normalization, and on-the-fly SpecAugment. Industrial training uses 50-layer SAN-M encoder and 16-layer transformer decoder (hidden size 2048), with batch sizes of up to 6,000 frames/GPU over 16 GPUs (Shi et al., 2023).

7. Extensions: Speaker Attribution and Multi-Speaker ASR

The Paraformer-zh architecture is further adapted for speaker-attributed ASR (SA-Paraformer), with a specialized speaker encoder and an auxiliary speaker decoder using cosine-similarity attention against an utterance's speaker inventory. The model outperforms cascaded speaker-attributed and diarization systems by reducing speaker-dependent CER by up to 6.1% relative (on AliMeeting), while reducing RTF by a factor of 10 compared to joint AR models. Techniques such as filling/inactive-speaker injection and interCTC auxiliary losses further stabilize performance in highly variable multi-speaker scenarios (Li et al., 2023).


Paraformer-zh demonstrates that highly optimized non-autoregressive architectures, when equipped with targeted improvements such as CIF, GLM-style sampling, and explicit biasing modules, yield state-of-the-art accuracy, recall, and efficiency for large-scale industrial Mandarin ASR in both standard and contextualized use cases (Gao et al., 2022, Shi et al., 2023, Li et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Paraformer-zh.