TDS-ConvNets for Low-Latency Speech Recognition
- TDS-ConvNets are fully convolutional acoustic models using time–depth separable 1D convolutional residual blocks designed for online, low-latency speech recognition.
- They combine grouped temporal convolutions, pointwise convolutions, and Connectionist Temporal Classification (CTC) loss to deliver high throughput with competitive word error rates.
- Benchmark results show roughly three times the throughput and significantly reduced real-time factor compared to LC-BLSTM baselines, with only a marginal increase in WER.
Time-Depth Separable Convolutional Networks (TDS-ConvNets) are a class of fully convolutional architectures specifically designed for high-throughput, low-latency end-to-end speech recognition. Developed with the goal of providing efficient online acoustic modeling, TDS-ConvNets employ “time–depth separable” 1D convolutional residual blocks, together with Connectionist Temporal Classification (CTC) loss and optimized beam search decoding. This approach results in significant improvements in inference speed and latency, while maintaining or surpassing competitive word error rates (WER) in both clean and noisy speech recognition scenarios (Pratap et al., 2020).
1. Mathematical Formulation of the TDS Block
A TDS block is defined as a residual 1D convolution that is both time- and depth-separable. The structural parameters are denoted as TDS(), where is the number of channel-groups, is the width of each group (so total channels ), is the temporal kernel size, and is the right padding (number of future frames used for context).
The block comprises the following stages:
- Layer normalization is applied along the channel and group width axes, with statistics computed for each individual time-step:
- Grouped 1D convolution over time: The normalized input is split into contiguous groups. Each group undergoes temporal convolution using group-specific kernel with asymmetric padding 0:
1
The outputs for all groups are concatenated across the channel dimension.
- Pointwise (1×1) convolution: The output 2 is transformed by a learnable matrix 3 and bias 4:
5
- Residual connection and ReLU:
6
Factorization into grouped and pointwise convolutions significantly reduces parameter count compared to dense temporal convolution. Asymmetric padding, with small 7, restricts the required future context, trading off accuracy for lower latency.
2. Model Architecture and Layerwise Specifications
The TDS-ConvNet acoustic encoder consists of a sequence of convolutional and TDS blocks, processing 80-dimensional log-Mel filterbank features sampled at 10 ms intervals. The architecture is as follows:
| Layer(s) | Specification | Output Shape |
|---|---|---|
| 0 (Input) | 80-dim log-Mel filterbank, 10 ms stride | 8 |
| 1 | 1×1 conv (80→256), 9, 0, pad=2 | 1 |
| 2–3 | 2× TDS(2), no subsample | 3 |
| 4 | 1×1 conv (800→512), 4, 5, pad=2 | 6 |
| 5–7 | 3× TDS(7), no subsample | 8 |
| 8 | 1×1 conv (512→512), 9, 0, pad=2 | 1 |
| 9–12 | 4× TDS(2), no subsample | 3 |
| 13 | 1×1 conv (512→512), 4, 5, pad=0 | 6 |
| 14–18 | 5× TDS(7), no subsample | 8 |
| 19 (Output) | Linear (512→N), log-softmax, CTC | 9 |
The model contains approximately 104 million parameters, with an overall 8× subsampling factor due to three stride-2 convolutions. Future context per TDS block is one frame (0 10 ms), accumulating to about 250 ms across all TDS layers.
3. Temporal Dynamics, Receptive Field, and Latency
The effective receptive field is governed by the progression of convolutional and TDS layers: 1 with 2 and 3 as the kernel size and stride, respectively, of layer 4. For the three stride-2 convolutions (5) and all TDS blocks (6), the full receptive field is approximately 10 seconds (i.e., 1000 frames at 10 ms interval).
Future context (right padding) only accumulates across the 7 TDS layers: 8
Constraining 9 to small values—here, 0—dramatically reduces total model-induced latency. Operational end-to-end latency is the sum of acoustic model future context (~250 ms), decoder delay, and audio chunking overhead.
4. CTC Objective and Sequence Decoding
The encoder output is projected to 1 vocabulary logits per frame, followed by log-softmax. The sequence is trained with Connectionist Temporal Classification (CTC), with loss
2
where 3 is the set of all length-4 alignments (with insertion of blank tokens) that collapse to 5. Standard forward–backward dynamic programming is used, without modification of the canonical CTC formulation.
Online decoding employs a prefix beam search algorithm over the CTC output, incorporating an 6-gram LLM using simple log-linear score interpolation.
5. Performance Metrics and Trade-Offs
TDS-ConvNet systems are benchmarked against strong low-latency baselines (LC-BLSTM + LF-MMI and LC-BLSTM + RNN-T), with the following results:
| Metric | LC-BLSTM + LF-MMI | LC-BLSTM + RNN-T | TDS conv + CTC |
|---|---|---|---|
| Parameters | 80 M | 60 M | 104 M |
| Inference precision | INT8 | INT8 | FP16 |
| WER (vid-clean) | 14.10% | 13.93% | 13.19% |
| WER (vid-noisy) | 22.15% | 22.58% | 21.16% |
| Throughput (sec audio/sec) | 55 | 64 | 147 |
| RTF@40 streams | 0.70 | 0.60 | 0.26 |
| User-perceived latency (40 str) | 1.18 s | – | 1.09 s |
Ablation on future context vs WER demonstrates that reducing future context from 5 s to 0.25 s incurs only a ~4% relative WER degradation (vid-clean: 12.65% → 13.19%; vid-noisy: 20.44% → 21.16%) while reducing model latency by an order of magnitude. TDS-ConvNet achieves approximately three times the throughput of an optimized hybrid baseline, with significantly reduced real-time factor (RTF).
6. Decoder Optimizations and Practical Considerations
The optimized wav2letter++ CTC beam-search decoder is further enhanced by two additional pruning strategies: 1. Acoustic-pruning: During beam expansion, retain only the top-7 tokens by local acoustic score (8). 2. Blank-pruning: If 9, only the blank arc is extended.
Combined with the encoder's 8× subsampling, decoding constitutes approximately 5% of total inference time.
Training and inference notes:
- Training is performed with SpecAugment and local mean/variance normalization (window: 300 frames, ≈3 s) to enable online normalization.
- Inference leverages FB GEMM for efficient mixed-precision (FP16) group convolutions.
- Chunk size (e.g., 750 ms) and the number of concurrent streams (e.g., 40–60) can be tuned to balance latency and throughput.
The overall architecture—fully convolutional and devoid of recurrent connections—enables streaming-friendly operation, often with higher throughput and lower latency than traditional RNN-based ASR systems, while holding competitive or superior recognition accuracy (Pratap et al., 2020).