Papers
Topics
Authors
Recent
Search
2000 character limit reached

TDS-ConvNets for Low-Latency Speech Recognition

Updated 12 May 2026
  • TDS-ConvNets are fully convolutional acoustic models using time–depth separable 1D convolutional residual blocks designed for online, low-latency speech recognition.
  • They combine grouped temporal convolutions, pointwise convolutions, and Connectionist Temporal Classification (CTC) loss to deliver high throughput with competitive word error rates.
  • Benchmark results show roughly three times the throughput and significantly reduced real-time factor compared to LC-BLSTM baselines, with only a marginal increase in WER.

Time-Depth Separable Convolutional Networks (TDS-ConvNets) are a class of fully convolutional architectures specifically designed for high-throughput, low-latency end-to-end speech recognition. Developed with the goal of providing efficient online acoustic modeling, TDS-ConvNets employ “time–depth separable” 1D convolutional residual blocks, together with Connectionist Temporal Classification (CTC) loss and optimized beam search decoding. This approach results in significant improvements in inference speed and latency, while maintaining or surpassing competitive word error rates (WER) in both clean and noisy speech recognition scenarios (Pratap et al., 2020).

1. Mathematical Formulation of the TDS Block

A TDS block is defined as a residual 1D convolution that is both time- and depth-separable. The structural parameters are denoted as TDS(c,kw,w,rc, k_w, w, r), where cc is the number of channel-groups, ww is the width of each group (so total channels C=cwC = c \cdot w), kwk_w is the temporal kernel size, and rr is the right padding (number of future frames used for context).

The block comprises the following stages:

  • Layer normalization is applied along the channel and group width axes, with statistics computed for each individual time-step:

μt=1Ci=1CXt,i,σt2=1Ci=1C(Xt,iμt)2,X~t,i=γiXt,iμtσt2+ϵ+βi.\mu_t = \frac{1}{C} \sum_{i=1}^C X_{t,i}, \quad \sigma_t^2 = \frac{1}{C} \sum_{i=1}^C (X_{t,i} - \mu_t)^2, \quad \widetilde X_{t,i} = \gamma_i \frac{X_{t,i} - \mu_t}{\sqrt{\sigma_t^2 + \epsilon}} + \beta_i.

  • Grouped 1D convolution over time: The normalized input is split into cc contiguous groups. Each group gg undergoes temporal convolution using group-specific kernel K(g)K^{(g)} with asymmetric padding cc0:

cc1

The outputs for all groups are concatenated across the channel dimension.

  • Pointwise (1×1) convolution: The output cc2 is transformed by a learnable matrix cc3 and bias cc4:

cc5

  • Residual connection and ReLU:

cc6

Factorization into grouped and pointwise convolutions significantly reduces parameter count compared to dense temporal convolution. Asymmetric padding, with small cc7, restricts the required future context, trading off accuracy for lower latency.

2. Model Architecture and Layerwise Specifications

The TDS-ConvNet acoustic encoder consists of a sequence of convolutional and TDS blocks, processing 80-dimensional log-Mel filterbank features sampled at 10 ms intervals. The architecture is as follows:

Layer(s) Specification Output Shape
0 (Input) 80-dim log-Mel filterbank, 10 ms stride cc8
1 1×1 conv (80→256), cc9, ww0, pad=2 ww1
2–3 2× TDS(ww2), no subsample ww3
4 1×1 conv (800→512), ww4, ww5, pad=2 ww6
5–7 3× TDS(ww7), no subsample ww8
8 1×1 conv (512→512), ww9, C=cwC = c \cdot w0, pad=2 C=cwC = c \cdot w1
9–12 4× TDS(C=cwC = c \cdot w2), no subsample C=cwC = c \cdot w3
13 1×1 conv (512→512), C=cwC = c \cdot w4, C=cwC = c \cdot w5, pad=0 C=cwC = c \cdot w6
14–18 5× TDS(C=cwC = c \cdot w7), no subsample C=cwC = c \cdot w8
19 (Output) Linear (512→N), log-softmax, CTC C=cwC = c \cdot w9

The model contains approximately 104 million parameters, with an overall 8× subsampling factor due to three stride-2 convolutions. Future context per TDS block is one frame (kwk_w0 10 ms), accumulating to about 250 ms across all TDS layers.

3. Temporal Dynamics, Receptive Field, and Latency

The effective receptive field is governed by the progression of convolutional and TDS layers: kwk_w1 with kwk_w2 and kwk_w3 as the kernel size and stride, respectively, of layer kwk_w4. For the three stride-2 convolutions (kwk_w5) and all TDS blocks (kwk_w6), the full receptive field is approximately 10 seconds (i.e., 1000 frames at 10 ms interval).

Future context (right padding) only accumulates across the kwk_w7 TDS layers: kwk_w8

Constraining kwk_w9 to small values—here, rr0—dramatically reduces total model-induced latency. Operational end-to-end latency is the sum of acoustic model future context (~250 ms), decoder delay, and audio chunking overhead.

4. CTC Objective and Sequence Decoding

The encoder output is projected to rr1 vocabulary logits per frame, followed by log-softmax. The sequence is trained with Connectionist Temporal Classification (CTC), with loss

rr2

where rr3 is the set of all length-rr4 alignments (with insertion of blank tokens) that collapse to rr5. Standard forward–backward dynamic programming is used, without modification of the canonical CTC formulation.

Online decoding employs a prefix beam search algorithm over the CTC output, incorporating an rr6-gram LLM using simple log-linear score interpolation.

5. Performance Metrics and Trade-Offs

TDS-ConvNet systems are benchmarked against strong low-latency baselines (LC-BLSTM + LF-MMI and LC-BLSTM + RNN-T), with the following results:

Metric LC-BLSTM + LF-MMI LC-BLSTM + RNN-T TDS conv + CTC
Parameters 80 M 60 M 104 M
Inference precision INT8 INT8 FP16
WER (vid-clean) 14.10% 13.93% 13.19%
WER (vid-noisy) 22.15% 22.58% 21.16%
Throughput (sec audio/sec) 55 64 147
RTF@40 streams 0.70 0.60 0.26
User-perceived latency (40 str) 1.18 s 1.09 s

Ablation on future context vs WER demonstrates that reducing future context from 5 s to 0.25 s incurs only a ~4% relative WER degradation (vid-clean: 12.65% → 13.19%; vid-noisy: 20.44% → 21.16%) while reducing model latency by an order of magnitude. TDS-ConvNet achieves approximately three times the throughput of an optimized hybrid baseline, with significantly reduced real-time factor (RTF).

6. Decoder Optimizations and Practical Considerations

The optimized wav2letter++ CTC beam-search decoder is further enhanced by two additional pruning strategies: 1. Acoustic-pruning: During beam expansion, retain only the top-rr7 tokens by local acoustic score (rr8). 2. Blank-pruning: If rr9, only the blank arc is extended.

Combined with the encoder's 8× subsampling, decoding constitutes approximately 5% of total inference time.

Training and inference notes:

  • Training is performed with SpecAugment and local mean/variance normalization (window: 300 frames, ≈3 s) to enable online normalization.
  • Inference leverages FB GEMM for efficient mixed-precision (FP16) group convolutions.
  • Chunk size (e.g., 750 ms) and the number of concurrent streams (e.g., 40–60) can be tuned to balance latency and throughput.

The overall architecture—fully convolutional and devoid of recurrent connections—enables streaming-friendly operation, often with higher throughput and lower latency than traditional RNN-based ASR systems, while holding competitive or superior recognition accuracy (Pratap et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TDS-ConvNets.