Paraformer-zh: Efficient Mandarin ASR
- Paraformer-zh is a non‐autoregressive ASR framework for Mandarin that integrates CIF predictors and GLM samplers to enable single-pass, parallel decoding.
- It achieves state-of-the-art performance with up to 10× faster inference and competitive CER on large-scale industrial benchmarks.
- The architecture supports contextual biasing and speaker attribution, enhancing hotword recognition and multi-speaker accuracy in real-world applications.
Paraformer-zh is an advanced non-autoregressive (NAR) end-to-end automatic speech recognition (ASR) framework for Mandarin Chinese, designed to achieve high recognition accuracy while providing substantial inference speedup over autoregressive (AR) transformer baselines. Developed and deployed at industrial scale, Paraformer-zh incorporates a continuous integrate-and-fire (CIF) predictor, a glancing LLM (GLM) sampler, and a parallel transformer decoder, enabling single-pass parallel decoding. Its architecture, loss formulation, and empirical results demonstrate the feasibility of closing the NAR/AR performance gap on large-vocabulary Mandarin ASR, including robust support for contextual and speaker-attributed extensions (Gao et al., 2022, Shi et al., 2023, Li et al., 2023).
1. Architectural Foundation and Model Components
Paraformer-zh is constructed from four principal modules: an input encoder, a CIF-based pointer/length predictor, a GLM-based sampler, and a parallel NAR decoder. The encoder is a deep stack of self-attentive SAN-M layers (50-layer in industrial deployments), mapping 80-dimensional frame-wise filterbank features to latent hidden states. The CIF predictor computes per-frame weights , transducing frame-level encodings into a variable-length sequence of acoustic embeddings aligned with target token boundaries. This mechanism predicts the number of output tokens and produces the requisite hidden variables for parallel decoding (Gao et al., 2022).
During training, the GLM sampler introduces partial ground-truth supervision by mixing ground-truth token embeddings with CIF output embeddings at positions sampled according to the Hamming distance between a first-pass prediction and the ground truth. The NAR decoder—a multi-layer (typically 16-layer, 2048-hidden) transformer—then consumes these embeddings in parallel to output the target character sequence. Unlike AR decoders, the entire sequence is produced in a single computational pass, yielding an inference real-time factor (RTF) faster than AR baselines of comparable accuracy (Shi et al., 2023).
2. CIF Predictor and Length Modeling
Central to Paraformer-zh’s NAR capability is the CIF predictor, which accumulates weights to both estimate output sequence length and trigger boundary emissions for each output token. The sequence of acoustic embeddings is generated by “firing” at every accumulation surpassing a calculated threshold . The number of output tokens is thus aligned to the true target , enforced by an MAE-based length prediction loss,
combined with standard cross-entropy and, in later stages, minimum word error rate (MWER) loss via negative sampling (Gao et al., 2022).
3. Contextual Biasing and Hotword Customization
Paraformer-zh serves as the backbone for the SeACo-Paraformer system, which extends the architecture with an explicit semantic biasing module for hotword customization (Shi et al., 2023). User-supplied hotwords are encoded using a 2-layer LSTM to generate and further processed by a 4-layer transformer bias decoder employing dual cross-attention streams (from both decoder and CIF outputs to the hotword encodings). The resulting outputs are fused and projected through a “BiasOutLayer” to yield token probabilities over an extended symbol set (vocabulary plus a “no-bias” token).
Inference merges predictions from the Paraformer backbone (0) and the bias module step-wise, favoring bias output (1) at positions where a hotword is activated, otherwise defaulting to 2. Large-scale hotword lists are efficiently handled with Attention Score Filtering (ASF): top-3 hotwords (typ. 4) are selected based on attention activations, maintaining high recall and CER improvements as 5 grows into the thousands (Shi et al., 2023).
4. Mathematical Formulation and Training Losses
Given speech 6 and reference tokens 7, computations proceed as:
- 8
- 9
- 0
- 1
Hotwords 2 are embedded (3), and multi-head attention generates 4. The merged output 5 is combined with 6 to yield final token probabilities 7. The ASR loss is computed as standard cross-entropy, while the biasing loss,
8
is used during bias module finetuning, with the ASR backbone frozen.
Negative-sample MWER loss and GLM/GLM-based masking further boost accuracy. Notably, Paraformer-zh employs joint objective optimization for cross-entropy, MAE, and MWER. Sampling factor 9 is empirically optimal for balancing substitution and convergence.
5. Empirical Results and Performance Benchmarks
Paraformer-zh attains state-of-the-art CER on public and industrial Mandarin benchmarks. On AISHELL-1, Paraformer achieves Dev/Test CERs of 4.6/5.2%, matching AR transformer accuracy but with a 12× RTF gain. On a 20,000-hour industrial Mandarin task, Paraformer-zh yields a CER of 14.07% (far-field) and 7.86% (common) with the MWER stage, maintaining a <2% relative gap to AR transformers. In the SeACo-Paraformer context, general ASR CER is ~3.2% across variants; hotword recall rises to ≈65% (ASF-augmented), with 30.2% relative CER reductions compared to prior contextual baselines as the hotword list scales from 231 to 4000 entries. Decoding speed with Paraformer-zh remains >10× over beam-searched AR (Shi et al., 2023, Gao et al., 2022).
6. Mandarin-Specific Implementation and Tokenization
All Paraformer-zh variants utilize Mandarin character-level modeling, with a symbol inventory of 4,500–5,000 characters. There is no use of subword modeling or byte-pair encoding; the same character embeddings are shared between ASR and contextual bias modules. Preprocessing leverages 80-dim log-Mel features, mean-variance normalization, and on-the-fly SpecAugment. Industrial training uses 50-layer SAN-M encoder and 16-layer transformer decoder (hidden size 2048), with batch sizes of up to 6,000 frames/GPU over 16 GPUs (Shi et al., 2023).
7. Extensions: Speaker Attribution and Multi-Speaker ASR
The Paraformer-zh architecture is further adapted for speaker-attributed ASR (SA-Paraformer), with a specialized speaker encoder and an auxiliary speaker decoder using cosine-similarity attention against an utterance's speaker inventory. The model outperforms cascaded speaker-attributed and diarization systems by reducing speaker-dependent CER by up to 6.1% relative (on AliMeeting), while reducing RTF by a factor of 10 compared to joint AR models. Techniques such as filling/inactive-speaker injection and interCTC auxiliary losses further stabilize performance in highly variable multi-speaker scenarios (Li et al., 2023).
Paraformer-zh demonstrates that highly optimized non-autoregressive architectures, when equipped with targeted improvements such as CIF, GLM-style sampling, and explicit biasing modules, yield state-of-the-art accuracy, recall, and efficiency for large-scale industrial Mandarin ASR in both standard and contextualized use cases (Gao et al., 2022, Shi et al., 2023, Li et al., 2023).