Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniEnc-CASSNAT: Efficient Non-Autoregressive ASR

Updated 2 July 2026
  • The paper demonstrates that a single encoder, reused in two forward passes, effectively models both acoustic and token-level contexts with a 30% parameter reduction.
  • It integrates CTC and CASS-NAT paradigms with iterative error-based sampled alignment to enhance non-autoregressive decoding performance.
  • Empirical results show that UniEnc-CASSNAT achieves competitive word error rates across benchmarks while delivering 3–5× faster inference than autoregressive models.

UniEnc-CASSNAT is an encoder-only non-autoregressive automatic speech recognition (ASR) framework designed to efficiently leverage speech self-supervised learning (SSL) models as foundational components. It integrates and extends the connectionist temporal classification (CTC) and CTC alignment-based single-step non-autoregressive transformer (CASS-NAT) paradigms, enabling efficient, dependency-aware sequence modeling with reduced parameter count and fast inference (Fan et al., 2024).

1. Architectural Overview

UniEnc-CASSNAT comprises three principal modules:

  • A front-end convolutional encoder mapping raw acoustic signals to a frame-level hidden sequence H0RT×dH^0 \in \mathbb{R}^{T \times d}, as in HuBERT’s configuration.
  • A contextual encoder fctx()f_{\mathrm{ctx}}(\cdot) instantiating LL Transformer layers (e.g., L=12L=12), optionally initialized from a speech SSL model such as HuBERT.
  • A token-level acoustic extractor ("TAE-E") employing a compact self-attention mechanism to pool UU token-level embeddings ERU×dE \in \mathbb{R}^{U \times d} based on frame-token alignments.

UniEnc-CASSNAT discards the classic sequence decoder. Instead, the contextual encoder fctxf_{\mathrm{ctx}} is reused in two consecutive forward passes:

  1. The first pass processes acoustic features alone: O0=fctx(H0)RT×dO^0 = f_{\mathrm{ctx}}(H^0) \in \mathbb{R}^{T \times d}.
  2. Token-level embeddings are extracted from O0O^0 using TAE-E, then concatenated back to the frame sequence: H1=concat(H0,E)R(T+U)×dH^1 = \textrm{concat}(H^0, E) \in \mathbb{R}^{(T+U) \times d}.
  3. The second pass computes fctx()f_{\mathrm{ctx}}(\cdot)0. The final fctx()f_{\mathrm{ctx}}(\cdot)1 vectors correspond to token-level “decoder” outputs. The initial fctx()f_{\mathrm{ctx}}(\cdot)2 vectors provide optional auxiliary CTC supervision.

This design enables a single encoder module to encode both acoustic and token contexts, reducing parameter count and facilitating transfer from SSL pretraining.

2. Mathematical Formalism

Key notation:

  • fctx()f_{\mathrm{ctx}}(\cdot)3: acoustic feature sequence after convolutional encoding (fctx()f_{\mathrm{ctx}}(\cdot)4).
  • fctx()f_{\mathrm{ctx}}(\cdot)5: ground-truth output token sequence.
  • fctx()f_{\mathrm{ctx}}(\cdot)6: output-classification matrices for CTC and "decoder" projections.

First Pass (Encoder-side)

  • fctx()f_{\mathrm{ctx}}(\cdot)7
  • fctx()f_{\mathrm{ctx}}(\cdot)8
  • Per-frame CTC distribution: fctx()f_{\mathrm{ctx}}(\cdot)9
  • Frame-to-token alignment LL0, typically found via Viterbi decoding or sampling.
  • Token-level embedding extraction: LL1, with LL2 formed by pooling (or attending) over LL3.

Second Pass (Joint Frame/Token Context)

  • LL4
  • LL5
  • Frame outputs: LL6 (auxiliary CTC loss)
  • Token outputs: LL7 (final classification)
  • Decoder-side distribution: LL8
  • Auxiliary second-pass CTC: LL9

3. Training Objectives

The objective unifies multiple losses:

  • First-pass CTC loss:

L=12L=120

  • Second-pass CTC loss:

L=12L=121

  • Decoder-side cross-entropy:

L=12L=122

The full training objective is L=12L=123 with hyperparameters L=12L=124 (empirically, L=12L=125 yields optimal WER).

4. Inference and Decoding Workflow

UniEnc-CASSNAT achieves non-autoregressive, single-step generation of L=12L=126 tokens while capturing token dependencies through iterative refinement called error-based sampled alignments (ESA):

  • Iteration 0:

Run first pass, obtain L=12L=127 and greedy CTC. Identify low-confidence frames (L=12L=128). For each of L=12L=129 samples, resample low-confidence labels, extract alternative alignments and TAEs.

  • Iteration 1:

For each sampled alignment from iteration 0, run the second pass with concatenated features, compute UU0, and output token predictions. Compute updated low-confidence frame sets, sample UU1 additional alignments per first-pass sample; extract new TAEs; optionally further iterate.

  • Candidate Ranking:

Score each candidate hypothesis UU2 by summing log-probabilities from decoder cross-entropy and/or second-pass CTC losses; return the top-ranked sequence.

Despite a single non-autoregressive step, token-token and token-frame interactions arise from self-attention in pass two and iterative TAE extraction.

5. Empirical Results

UniEnc-CASSNAT was benchmarked on Librispeech (100h, 1024 word-pieces), MyST (240h, 500 word-pieces), and Aishell1 (170h, 4230 chars), always fine-tuning a HuBERT-base encoder (12 Transformer layers) and never using an external LLM. Model sizes, word error rates (WER), and real-time factors (RTF) are summarized as follows:

Model Params (M) Librispeech WER (dev-clean/other) MyST WER (dev/test) Aishell1 (dev/test) RTF
AT-w/o SSL 85.1 6.6 / 18.2 0.325
AT-w/ SSL 121.6 4.8 / 11.0 11.4/13.1 4.0/4.3 0.486
CTC 95.7 6.1 / 13.8 12.9/14.5 4.5/4.9 0.005
CASS-NAT 130.5 4.7 / 11.4 11.9/13.5 4.0/4.3 0.014
UniEnc-CASSNAT 99.3 4.9 / 11.0 11.8/13.5 4.2/4.5 0.093

UniEnc-CASSNAT matches or outperforms CASS-NAT while using approximately 30% fewer parameters. It approaches the performance of autoregressive SSL-initialized ASR systems, offering 3–5× faster inference.

6. Design Ablations and Analysis

Extensive ablations elucidate several critical findings:

  • Multi-pass CTC (MP-CTC):

SP-CTC (only UU3) underperforms relative to CASS-NAT. Introducing the second pass (UU4) without sampling does not close the gap. MP-CTC plus two-iteration decoding (UU5, UU6) surpasses CASS-NAT (WER reaches 4.9/11.0 on dev, 4.8/11.0 test).

  • TAE-E module size:

Evaluating feed-forward dimensions UU7 in TAE-E shows marginal WER improvement from UU8 at a substantial parameter increase (+5M); UU9 represents an optimal trade-off.

  • Sampling Strategy:

Splitting a fixed sample budget (e.g., 50) across two ESA refinement rounds (25×2) improves performance beyond single-round (50) sampling, supporting the benefit of iterative TAE extraction.

7. Significance and Implications

UniEnc-CASSNAT establishes that a single Transformer encoder, repurposed across encoding and decoding via two forward passes, can learn token interdependencies comparably to encoder-decoder architectures but with considerable parameter and inference efficiency benefits. The multi-pass CTC loss, coupled with ESA-based iterative decoding, closes much of the typical accuracy gap to autoregressive SSL-initialized models while preserving substantial speed advantages. This approach demonstrates a scalable strategy for integrating speech foundation models within non-autoregressive ASR frameworks (Fan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniEnc-CASSNAT.