UniEnc-CASSNAT: Efficient Non-Autoregressive ASR

Updated 2 July 2026

The paper demonstrates that a single encoder, reused in two forward passes, effectively models both acoustic and token-level contexts with a 30% parameter reduction.
It integrates CTC and CASS-NAT paradigms with iterative error-based sampled alignment to enhance non-autoregressive decoding performance.
Empirical results show that UniEnc-CASSNAT achieves competitive word error rates across benchmarks while delivering 3–5× faster inference than autoregressive models.

UniEnc-CASSNAT is an encoder-only non-autoregressive automatic speech recognition (ASR) framework designed to efficiently leverage speech self-supervised learning (SSL) models as foundational components. It integrates and extends the connectionist temporal classification (CTC) and CTC alignment-based single-step non-autoregressive transformer (CASS-NAT) paradigms, enabling efficient, dependency-aware sequence modeling with reduced parameter count and fast inference (Fan et al., 2024).

1. Architectural Overview

UniEnc-CASSNAT comprises three principal modules:

A front-end convolutional encoder mapping raw acoustic signals to a frame-level hidden sequence $H^0 \in \mathbb{R}^{T \times d}$ , as in HuBERT’s configuration.
A contextual encoder $f_{\mathrm{ctx}}(\cdot)$ instantiating $L$ Transformer layers (e.g., $L=12$ ), optionally initialized from a speech SSL model such as HuBERT.
A token-level acoustic extractor ("TAE-E") employing a compact self-attention mechanism to pool $U$ token-level embeddings $E \in \mathbb{R}^{U \times d}$ based on frame-token alignments.

UniEnc-CASSNAT discards the classic sequence decoder. Instead, the contextual encoder $f_{\mathrm{ctx}}$ is reused in two consecutive forward passes:

The first pass processes acoustic features alone: $O^0 = f_{\mathrm{ctx}}(H^0) \in \mathbb{R}^{T \times d}$ .
Token-level embeddings are extracted from $O^0$ using TAE-E, then concatenated back to the frame sequence: $H^1 = \textrm{concat}(H^0, E) \in \mathbb{R}^{(T+U) \times d}$ .
The second pass computes $f_{\mathrm{ctx}}(\cdot)$ 0. The final $f_{\mathrm{ctx}}(\cdot)$ 1 vectors correspond to token-level “decoder” outputs. The initial $f_{\mathrm{ctx}}(\cdot)$ 2 vectors provide optional auxiliary CTC supervision.

This design enables a single encoder module to encode both acoustic and token contexts, reducing parameter count and facilitating transfer from SSL pretraining.

2. Mathematical Formalism

Key notation:

$f_{\mathrm{ctx}}(\cdot)$ 3: acoustic feature sequence after convolutional encoding ( $f_{\mathrm{ctx}}(\cdot)$ 4).
$f_{\mathrm{ctx}}(\cdot)$ 5: ground-truth output token sequence.
$f_{\mathrm{ctx}}(\cdot)$ 6: output-classification matrices for CTC and "decoder" projections.

First Pass (Encoder-side)

$f_{\mathrm{ctx}}(\cdot)$ 7
$f_{\mathrm{ctx}}(\cdot)$ 8
Per-frame CTC distribution: $f_{\mathrm{ctx}}(\cdot)$ 9
Frame-to-token alignment $L$ 0, typically found via Viterbi decoding or sampling.
Token-level embedding extraction: $L$ 1, with $L$ 2 formed by pooling (or attending) over $L$ 3.

Second Pass (Joint Frame/Token Context)

$L$ 4
$L$ 5
Frame outputs: $L$ 6 (auxiliary CTC loss)
Token outputs: $L$ 7 (final classification)
Decoder-side distribution: $L$ 8
Auxiliary second-pass CTC: $L$ 9

3. Training Objectives

The objective unifies multiple losses:

First-pass CTC loss:

$L=12$ 0

Second-pass CTC loss:

$L=12$ 1

Decoder-side cross-entropy:

$L=12$ 2

The full training objective is $L=12$ 3 with hyperparameters $L=12$ 4 (empirically, $L=12$ 5 yields optimal WER).

4. Inference and Decoding Workflow

UniEnc-CASSNAT achieves non-autoregressive, single-step generation of $L=12$ 6 tokens while capturing token dependencies through iterative refinement called error-based sampled alignments (ESA):

Iteration 0:

Run first pass, obtain $L=12$ 7 and greedy CTC. Identify low-confidence frames ( $L=12$ 8). For each of $L=12$ 9 samples, resample low-confidence labels, extract alternative alignments and TAEs.

Iteration 1:

For each sampled alignment from iteration 0, run the second pass with concatenated features, compute $U$ 0, and output token predictions. Compute updated low-confidence frame sets, sample $U$ 1 additional alignments per first-pass sample; extract new TAEs; optionally further iterate.

Candidate Ranking:

Score each candidate hypothesis $U$ 2 by summing log-probabilities from decoder cross-entropy and/or second-pass CTC losses; return the top-ranked sequence.

Despite a single non-autoregressive step, token-token and token-frame interactions arise from self-attention in pass two and iterative TAE extraction.

5. Empirical Results

UniEnc-CASSNAT was benchmarked on Librispeech (100h, 1024 word-pieces), MyST (240h, 500 word-pieces), and Aishell1 (170h, 4230 chars), always fine-tuning a HuBERT-base encoder (12 Transformer layers) and never using an external LLM. Model sizes, word error rates (WER), and real-time factors (RTF) are summarized as follows:

Model	Params (M)	Librispeech WER (dev-clean/other)	MyST WER (dev/test)	Aishell1 (dev/test)	RTF
AT-w/o SSL	85.1	6.6 / 18.2	—	—	0.325
AT-w/ SSL	121.6	4.8 / 11.0	11.4/13.1	4.0/4.3	0.486
CTC	95.7	6.1 / 13.8	12.9/14.5	4.5/4.9	0.005
CASS-NAT	130.5	4.7 / 11.4	11.9/13.5	4.0/4.3	0.014
UniEnc-CASSNAT	99.3	4.9 / 11.0	11.8/13.5	4.2/4.5	0.093

UniEnc-CASSNAT matches or outperforms CASS-NAT while using approximately 30% fewer parameters. It approaches the performance of autoregressive SSL-initialized ASR systems, offering 3–5× faster inference.

6. Design Ablations and Analysis

Extensive ablations elucidate several critical findings:

Multi-pass CTC (MP-CTC):

SP-CTC (only $U$ 3) underperforms relative to CASS-NAT. Introducing the second pass ( $U$ 4) without sampling does not close the gap. MP-CTC plus two-iteration decoding ( $U$ 5, $U$ 6) surpasses CASS-NAT (WER reaches 4.9/11.0 on dev, 4.8/11.0 test).

TAE-E module size:

Evaluating feed-forward dimensions $U$ 7 in TAE-E shows marginal WER improvement from $U$ 8 at a substantial parameter increase (+5M); $U$ 9 represents an optimal trade-off.

Sampling Strategy:

Splitting a fixed sample budget (e.g., 50) across two ESA refinement rounds (25×2) improves performance beyond single-round (50) sampling, supporting the benefit of iterative TAE extraction.

7. Significance and Implications

UniEnc-CASSNAT establishes that a single Transformer encoder, repurposed across encoding and decoding via two forward passes, can learn token interdependencies comparably to encoder-decoder architectures but with considerable parameter and inference efficiency benefits. The multi-pass CTC loss, coupled with ESA-based iterative decoding, closes much of the typical accuracy gap to autoregressive SSL-initialized models while preserving substantial speed advantages. This approach demonstrates a scalable strategy for integrating speech foundation models within non-autoregressive ASR frameworks (Fan et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniEnc-CASSNAT.

UniEnc-CASSNAT: Efficient Non-Autoregressive ASR

1. Architectural Overview

2. Mathematical Formalism

3. Training Objectives

4. Inference and Decoding Workflow

5. Empirical Results

6. Design Ablations and Analysis

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

UniEnc-CASSNAT: Efficient Non-Autoregressive ASR

1. Architectural Overview

2. Mathematical Formalism

3. Training Objectives

4. Inference and Decoding Workflow

5. Empirical Results

6. Design Ablations and Analysis

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research