SSCFormer: Efficient Streaming ASR Architecture
- SSCFormer is a streaming ASR architecture that extends chunk-wise Conformer frameworks by introducing SSC-MHSA and a dual-branch Chunked Causal Convolution module.
- It enables real-time inference and large-batch parallel training by operating block-by-block with efficient dynamic attention masking and localized convolution.
- Empirical results on the AISHELL-1 benchmark show SSCFormer achieves lower CER and scalable performance while maintaining linear computational complexity.
SSCFormer is an architecture for streaming automatic speech recognition (ASR) that extends chunk-wise Conformer frameworks by introducing a Sequentially Sampled Chunk (SSC) multi-head self-attention (MHSA) scheme and a dual-branch Chunked Causal Convolution (C2Conv) module. The design addresses the need for efficient, accurate online inference while maintaining support for large-batch parallel training, as well as scaling favorably with input length. SSCFormer matches or outperforms prior streaming and time-restricted self-attention models on standard benchmarks with linear computational complexity (Wang et al., 2022).
1. Architectural Overview
SSCFormer builds upon the established Conformer backbone, which consists of stacks of Conformer blocks—each combining MHSA and convolutional modules—followed by a hybrid CTC/attention decoder. In SSCFormer, all operations are adapted for the chunk-wise streaming setting:
- The encoder is a stack of twelve layers, alternating between two block types:
- Chunk–C2Conv blocks: employ chunk-wise multi-head self-attention and a chunked causal convolution.
- SSC–C2Conv blocks: employ SSC-MHSA for long-range context and a C2Conv layer.
- Computation is performed block-by-block as:
$\begin{aligned} \hat{Z}^l & = 0.5\,\mathrm{MLP}(\mathrm{LN}(Z^{l-1})) + Z^{l-1},\ \tilde{Z}^l & = \mathrm{Chunk\mbox{-}MHSA}\bigl(\mathrm{LN}(\hat{Z}^l)\bigr) + \hat{Z}^l,\ \bar{Z}^l & = \mathrm{C2Conv}\bigl(\mathrm{LN}(\tilde{Z}^l)\bigr) + \tilde{Z}^l,\ Z^l & = \mathrm{LN}\left(0.5\,\mathrm{MLP}(\mathrm{LN}(\bar{Z}^l)) + \bar{Z}^l\right) \end{aligned}$
For alternate layers, SSC-MHSA replaces Chunk-MHSA.
This interleaving of local and long-range chunk-wise computations enables streaming deployment with linear complexity in input length and supports large batch sizes during training.
2. Sequentially Sampled Chunk (SSC) Scheme
The SSC-MHSA mechanism is the principal innovation that facilitates efficient long-range context interaction within the streaming, chunk-wise paradigm:
- Regular chunk partitioning divides the input sequence into contiguous blocks of width
- SSC partitioning forms cross-chunk groups by sampling with stride :
This grouping allows each SSCGroup to aggregate information from distant temporal locations within a local attention window, avoiding quadratic complexity.
- Efficient masking and batching: All utterances are padded to a common length, and dynamic attention masks restrict each token’s view to the current streaming boundary to guarantee causality. Sampling is implemented as a single gather operation.
- Complexity: Each SSCGroup is attended with complexity, and the full input with , yielding
for hidden size , contrasting favorable with 0 for global MHSA.
3. Chunked Causal Convolution (C2Conv)
C2Conv is a two-branch module enhancing local and semi-local context capture under streaming constraints, replacing the vanilla convolution in classic Conformer models:
- Causal convolution: weights 1 are masked to use only past context.
- Chunked convolution: same weights applied unmasked within a chunk, allowing future context locally.
- The outputs are combined:
2
Empirically, with 3 and 4, this combination gave the best character error rate (CER) on the AISHELL-1 benchmark (Wang et al., 2022).
C2Conv supports chunk-wise MHSA without incurring additional latency, since all operations are confined to local or chunk-local neighborhoods.
4. Computational Characteristics and Training Regime
SSCFormer is optimized for both training efficiency and real-time streaming inference:
- Parallel training: All chunk-wise attention and convolutions are independent per chunk, facilitating large-batch training; experiments used batch 48 on two RTX 3090 GPUs (versus 36 for the quadratic U2 baseline) (Wang et al., 2022).
- Streaming inference: The encoder processes input chunk by chunk. In the first chunk, SSC-MHSA reduces to Chunk-MHSA due to lack of history. For subsequent chunks, at most one chunk's past activations is cached for formation of SSCGroups.
- Latency and scaling: Due to linear per-chunk cost, the real-time factor (RTF) remains constant regardless of utterance length, in stark contrast to quadratic self-attention approaches whose RTF grows with input length.
5. Empirical Evaluation and Benchmarks
SSCFormer has been empirically validated on the AISHELL-1 Mandarin ASR benchmark. Key results:
| Model / Configuration | CER (%) | Latency Remarks |
|---|---|---|
| Vanilla chunk Conformer (causal only) | 6.09 | 5 |
| Shifted-chunk Conformer [Wang2022SCE] | 5.77 | 6, 7 ms |
| SSC only (SSC-MHSA + causal conv) | 5.58 | 8 |
| SSC + C2Conv (SSCFormer, best) | 5.33 | 9, 0 ms |
| U2 [Zhang2020USA] | 5.45 | 1 ms |
| SChunk-Conformer [Wang2022SCE] | 5.77 | 2 ms |
| CUSIDE [An2022CUSIDEC] | 5.47 | 3 ms |
- Best CER (end-to-end): 5.33% (SSC + C2Conv, 4)
- Best with bidecoder rescoring: 5.10%
- Best with external LM: 4.78%
SSCFormer achieves lower CER than all other chunk-wise and time-restricted baselines. Its linear complexity allows competitive batch sizes and total training time (5 h for 180 epochs with batch 48, compared to 6 h for U2 at batch 36).
6. Comparison with Related Architectures
SSCFormer’s distinguishing factors are:
- Long-range context: Through SSC-MHSA, it overcomes the limitation of chunk-wise MHSA, which cannot attend beyond chunk boundaries.
- Streaming safety: All computations are causal or strictly within the chunk, ensuring no future frames are used.
- Parallelism: Fully parallel training due to independence of chunk-wise computations.
- Competitive accuracy: Outperforms previous chunked self-attention and time-restricted streaming architectures without increasing computational burden (Wang et al., 2022).
A plausible implication is that the SSC partitioning and dual-branch C2Conv framework could generalize to other streaming sequence modeling tasks where the tradeoff between local and global context is critical.
7. Implementation Hyperparameters and Practical Considerations
Key hyperparameters, as per the reported best AISHELL-1 results:
- Convolution kernel size: 7
- Convolution mixing parameter: 8
- Chunk size: 9
- Number of encoder layers: 0 (6 Chunk–C2Conv, 6 SSC–C2Conv interleaved)
- Batch size: 1 (training)
- Total parameter count: closely matches baseline Conformer architectures
Inference requires only a single chunk’s worth of state; the chunk-wise streaming paradigm makes the model suitable for low-latency ASR deployment scenarios.
SSCFormer advances the state of the art for streaming ASR models by combining linear-cost attention spanning long-range dependencies with efficient chunked convolution, achieving superior recognition accuracy and training throughput under realistic online constraints (Wang et al., 2022).