Papers
Topics
Authors
Recent
Search
2000 character limit reached

SSCFormer: Efficient Streaming ASR Architecture

Updated 21 April 2026
  • SSCFormer is a streaming ASR architecture that extends chunk-wise Conformer frameworks by introducing SSC-MHSA and a dual-branch Chunked Causal Convolution module.
  • It enables real-time inference and large-batch parallel training by operating block-by-block with efficient dynamic attention masking and localized convolution.
  • Empirical results on the AISHELL-1 benchmark show SSCFormer achieves lower CER and scalable performance while maintaining linear computational complexity.

SSCFormer is an architecture for streaming automatic speech recognition (ASR) that extends chunk-wise Conformer frameworks by introducing a Sequentially Sampled Chunk (SSC) multi-head self-attention (MHSA) scheme and a dual-branch Chunked Causal Convolution (C2Conv) module. The design addresses the need for efficient, accurate online inference while maintaining support for large-batch parallel training, as well as scaling favorably with input length. SSCFormer matches or outperforms prior streaming and time-restricted self-attention models on standard benchmarks with linear computational complexity (Wang et al., 2022).

1. Architectural Overview

SSCFormer builds upon the established Conformer backbone, which consists of stacks of Conformer blocks—each combining MHSA and convolutional modules—followed by a hybrid CTC/attention decoder. In SSCFormer, all operations are adapted for the chunk-wise streaming setting:

  • The encoder is a stack of twelve layers, alternating between two block types:
    • Chunk–C2Conv blocks: employ chunk-wise multi-head self-attention and a chunked causal convolution.
    • SSC–C2Conv blocks: employ SSC-MHSA for long-range context and a C2Conv layer.
  • Computation is performed block-by-block as:

    $\begin{aligned} \hat{Z}^l & = 0.5\,\mathrm{MLP}(\mathrm{LN}(Z^{l-1})) + Z^{l-1},\ \tilde{Z}^l & = \mathrm{Chunk\mbox{-}MHSA}\bigl(\mathrm{LN}(\hat{Z}^l)\bigr) + \hat{Z}^l,\ \bar{Z}^l & = \mathrm{C2Conv}\bigl(\mathrm{LN}(\tilde{Z}^l)\bigr) + \tilde{Z}^l,\ Z^l & = \mathrm{LN}\left(0.5\,\mathrm{MLP}(\mathrm{LN}(\bar{Z}^l)) + \bar{Z}^l\right) \end{aligned}$

    For alternate layers, SSC-MHSA replaces Chunk-MHSA.

This interleaving of local and long-range chunk-wise computations enables streaming deployment with linear complexity in input length and supports large batch sizes during training.

2. Sequentially Sampled Chunk (SSC) Scheme

The SSC-MHSA mechanism is the principal innovation that facilitates efficient long-range context interaction within the streaming, chunk-wise paradigm:

  • Regular chunk partitioning divides the input sequence Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}] into contiguous blocks of width WW

    RegularChunki={ziW,,ziW+W1}\text{RegularChunk}_i = \{ z_{iW}, \ldots, z_{iW + W-1} \}

  • SSC partitioning forms cross-chunk groups by sampling with stride M=L/WM = \lceil L/W \rceil:

    SSCGroupj={zj,zj+M,zj+2M,,zj+(W1)M}\text{SSCGroup}_j = \{ z_j, z_{j+M}, z_{j+2M}, \ldots, z_{j+(W-1)M} \}

This grouping allows each SSCGroup to aggregate information from distant temporal locations within a local attention window, avoiding quadratic complexity.

  • Efficient masking and batching: All utterances are padded to a common length, and dynamic attention masks restrict each token’s view to the current streaming boundary to guarantee causality. Sampling is implemented as a single gather operation.
  • Complexity: Each SSCGroup is attended with O(W2)O(W^2) complexity, and the full input with O(LW)O(LW), yielding

    Ω(SSC-MHSA)=4LC2+2WLC\Omega(\text{SSC-MHSA}) = 4LC^2 + 2WL C

for hidden size CC, contrasting favorable with Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]0 for global MHSA.

3. Chunked Causal Convolution (C2Conv)

C2Conv is a two-branch module enhancing local and semi-local context capture under streaming constraints, replacing the vanilla convolution in classic Conformer models:

  • Causal convolution: weights Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]1 are masked to use only past context.
  • Chunked convolution: same weights applied unmasked within a chunk, allowing future context locally.
  • The outputs are combined:

    Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]2

Empirically, with Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]3 and Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]4, this combination gave the best character error rate (CER) on the AISHELL-1 benchmark (Wang et al., 2022).

C2Conv supports chunk-wise MHSA without incurring additional latency, since all operations are confined to local or chunk-local neighborhoods.

4. Computational Characteristics and Training Regime

SSCFormer is optimized for both training efficiency and real-time streaming inference:

  • Parallel training: All chunk-wise attention and convolutions are independent per chunk, facilitating large-batch training; experiments used batch 48 on two RTX 3090 GPUs (versus 36 for the quadratic U2 baseline) (Wang et al., 2022).
  • Streaming inference: The encoder processes input chunk by chunk. In the first chunk, SSC-MHSA reduces to Chunk-MHSA due to lack of history. For subsequent chunks, at most one chunk's past activations is cached for formation of SSCGroups.
  • Latency and scaling: Due to linear per-chunk cost, the real-time factor (RTF) remains constant regardless of utterance length, in stark contrast to quadratic self-attention approaches whose RTF grows with input length.

5. Empirical Evaluation and Benchmarks

SSCFormer has been empirically validated on the AISHELL-1 Mandarin ASR benchmark. Key results:

Model / Configuration CER (%) Latency Remarks
Vanilla chunk Conformer (causal only) 6.09 Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]5
Shifted-chunk Conformer [Wang2022SCE] 5.77 Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]6, Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]7 ms
SSC only (SSC-MHSA + causal conv) 5.58 Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]8
SSC + C2Conv (SSCFormer, best) 5.33 Z=[z0,,zL1]Z=[z_0, \ldots, z_{L-1}]9, WW0 ms
U2 [Zhang2020USA] 5.45 WW1 ms
SChunk-Conformer [Wang2022SCE] 5.77 WW2 ms
CUSIDE [An2022CUSIDEC] 5.47 WW3 ms
  • Best CER (end-to-end): 5.33% (SSC + C2Conv, WW4)
  • Best with bidecoder rescoring: 5.10%
  • Best with external LM: 4.78%

SSCFormer achieves lower CER than all other chunk-wise and time-restricted baselines. Its linear complexity allows competitive batch sizes and total training time (WW5 h for 180 epochs with batch 48, compared to WW6 h for U2 at batch 36).

SSCFormer’s distinguishing factors are:

  • Long-range context: Through SSC-MHSA, it overcomes the limitation of chunk-wise MHSA, which cannot attend beyond chunk boundaries.
  • Streaming safety: All computations are causal or strictly within the chunk, ensuring no future frames are used.
  • Parallelism: Fully parallel training due to independence of chunk-wise computations.
  • Competitive accuracy: Outperforms previous chunked self-attention and time-restricted streaming architectures without increasing computational burden (Wang et al., 2022).

A plausible implication is that the SSC partitioning and dual-branch C2Conv framework could generalize to other streaming sequence modeling tasks where the tradeoff between local and global context is critical.

7. Implementation Hyperparameters and Practical Considerations

Key hyperparameters, as per the reported best AISHELL-1 results:

  • Convolution kernel size: WW7
  • Convolution mixing parameter: WW8
  • Chunk size: WW9
  • Number of encoder layers: RegularChunki={ziW,,ziW+W1}\text{RegularChunk}_i = \{ z_{iW}, \ldots, z_{iW + W-1} \}0 (6 Chunk–C2Conv, 6 SSC–C2Conv interleaved)
  • Batch size: RegularChunki={ziW,,ziW+W1}\text{RegularChunk}_i = \{ z_{iW}, \ldots, z_{iW + W-1} \}1 (training)
  • Total parameter count: closely matches baseline Conformer architectures

Inference requires only a single chunk’s worth of state; the chunk-wise streaming paradigm makes the model suitable for low-latency ASR deployment scenarios.

SSCFormer advances the state of the art for streaming ASR models by combining linear-cost attention spanning long-range dependencies with efficient chunked convolution, achieving superior recognition accuracy and training throughput under realistic online constraints (Wang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SSCFormer.