UniEnc-CASSNAT: Efficient Non-Autoregressive ASR
- The paper demonstrates that a single encoder, reused in two forward passes, effectively models both acoustic and token-level contexts with a 30% parameter reduction.
- It integrates CTC and CASS-NAT paradigms with iterative error-based sampled alignment to enhance non-autoregressive decoding performance.
- Empirical results show that UniEnc-CASSNAT achieves competitive word error rates across benchmarks while delivering 3–5× faster inference than autoregressive models.
UniEnc-CASSNAT is an encoder-only non-autoregressive automatic speech recognition (ASR) framework designed to efficiently leverage speech self-supervised learning (SSL) models as foundational components. It integrates and extends the connectionist temporal classification (CTC) and CTC alignment-based single-step non-autoregressive transformer (CASS-NAT) paradigms, enabling efficient, dependency-aware sequence modeling with reduced parameter count and fast inference (Fan et al., 2024).
1. Architectural Overview
UniEnc-CASSNAT comprises three principal modules:
- A front-end convolutional encoder mapping raw acoustic signals to a frame-level hidden sequence , as in HuBERT’s configuration.
- A contextual encoder instantiating Transformer layers (e.g., ), optionally initialized from a speech SSL model such as HuBERT.
- A token-level acoustic extractor ("TAE-E") employing a compact self-attention mechanism to pool token-level embeddings based on frame-token alignments.
UniEnc-CASSNAT discards the classic sequence decoder. Instead, the contextual encoder is reused in two consecutive forward passes:
- The first pass processes acoustic features alone: .
- Token-level embeddings are extracted from using TAE-E, then concatenated back to the frame sequence: .
- The second pass computes 0. The final 1 vectors correspond to token-level “decoder” outputs. The initial 2 vectors provide optional auxiliary CTC supervision.
This design enables a single encoder module to encode both acoustic and token contexts, reducing parameter count and facilitating transfer from SSL pretraining.
2. Mathematical Formalism
Key notation:
- 3: acoustic feature sequence after convolutional encoding (4).
- 5: ground-truth output token sequence.
- 6: output-classification matrices for CTC and "decoder" projections.
First Pass (Encoder-side)
- 7
- 8
- Per-frame CTC distribution: 9
- Frame-to-token alignment 0, typically found via Viterbi decoding or sampling.
- Token-level embedding extraction: 1, with 2 formed by pooling (or attending) over 3.
Second Pass (Joint Frame/Token Context)
- 4
- 5
- Frame outputs: 6 (auxiliary CTC loss)
- Token outputs: 7 (final classification)
- Decoder-side distribution: 8
- Auxiliary second-pass CTC: 9
3. Training Objectives
The objective unifies multiple losses:
- First-pass CTC loss:
0
- Second-pass CTC loss:
1
- Decoder-side cross-entropy:
2
The full training objective is 3 with hyperparameters 4 (empirically, 5 yields optimal WER).
4. Inference and Decoding Workflow
UniEnc-CASSNAT achieves non-autoregressive, single-step generation of 6 tokens while capturing token dependencies through iterative refinement called error-based sampled alignments (ESA):
- Iteration 0:
Run first pass, obtain 7 and greedy CTC. Identify low-confidence frames (8). For each of 9 samples, resample low-confidence labels, extract alternative alignments and TAEs.
- Iteration 1:
For each sampled alignment from iteration 0, run the second pass with concatenated features, compute 0, and output token predictions. Compute updated low-confidence frame sets, sample 1 additional alignments per first-pass sample; extract new TAEs; optionally further iterate.
- Candidate Ranking:
Score each candidate hypothesis 2 by summing log-probabilities from decoder cross-entropy and/or second-pass CTC losses; return the top-ranked sequence.
Despite a single non-autoregressive step, token-token and token-frame interactions arise from self-attention in pass two and iterative TAE extraction.
5. Empirical Results
UniEnc-CASSNAT was benchmarked on Librispeech (100h, 1024 word-pieces), MyST (240h, 500 word-pieces), and Aishell1 (170h, 4230 chars), always fine-tuning a HuBERT-base encoder (12 Transformer layers) and never using an external LLM. Model sizes, word error rates (WER), and real-time factors (RTF) are summarized as follows:
| Model | Params (M) | Librispeech WER (dev-clean/other) | MyST WER (dev/test) | Aishell1 (dev/test) | RTF |
|---|---|---|---|---|---|
| AT-w/o SSL | 85.1 | 6.6 / 18.2 | — | — | 0.325 |
| AT-w/ SSL | 121.6 | 4.8 / 11.0 | 11.4/13.1 | 4.0/4.3 | 0.486 |
| CTC | 95.7 | 6.1 / 13.8 | 12.9/14.5 | 4.5/4.9 | 0.005 |
| CASS-NAT | 130.5 | 4.7 / 11.4 | 11.9/13.5 | 4.0/4.3 | 0.014 |
| UniEnc-CASSNAT | 99.3 | 4.9 / 11.0 | 11.8/13.5 | 4.2/4.5 | 0.093 |
UniEnc-CASSNAT matches or outperforms CASS-NAT while using approximately 30% fewer parameters. It approaches the performance of autoregressive SSL-initialized ASR systems, offering 3–5× faster inference.
6. Design Ablations and Analysis
Extensive ablations elucidate several critical findings:
- Multi-pass CTC (MP-CTC):
SP-CTC (only 3) underperforms relative to CASS-NAT. Introducing the second pass (4) without sampling does not close the gap. MP-CTC plus two-iteration decoding (5, 6) surpasses CASS-NAT (WER reaches 4.9/11.0 on dev, 4.8/11.0 test).
- TAE-E module size:
Evaluating feed-forward dimensions 7 in TAE-E shows marginal WER improvement from 8 at a substantial parameter increase (+5M); 9 represents an optimal trade-off.
- Sampling Strategy:
Splitting a fixed sample budget (e.g., 50) across two ESA refinement rounds (25×2) improves performance beyond single-round (50) sampling, supporting the benefit of iterative TAE extraction.
7. Significance and Implications
UniEnc-CASSNAT establishes that a single Transformer encoder, repurposed across encoding and decoding via two forward passes, can learn token interdependencies comparably to encoder-decoder architectures but with considerable parameter and inference efficiency benefits. The multi-pass CTC loss, coupled with ESA-based iterative decoding, closes much of the typical accuracy gap to autoregressive SSL-initialized models while preserving substantial speed advantages. This approach demonstrates a scalable strategy for integrating speech foundation models within non-autoregressive ASR frameworks (Fan et al., 2024).