TDS-ConvNets for Low-Latency Speech Recognition

Updated 12 May 2026

TDS-ConvNets are fully convolutional acoustic models using time–depth separable 1D convolutional residual blocks designed for online, low-latency speech recognition.
They combine grouped temporal convolutions, pointwise convolutions, and Connectionist Temporal Classification (CTC) loss to deliver high throughput with competitive word error rates.
Benchmark results show roughly three times the throughput and significantly reduced real-time factor compared to LC-BLSTM baselines, with only a marginal increase in WER.

Time-Depth Separable Convolutional Networks (TDS-ConvNets) are a class of fully convolutional architectures specifically designed for high-throughput, low-latency end-to-end speech recognition. Developed with the goal of providing efficient online acoustic modeling, TDS-ConvNets employ “time–depth separable” 1D convolutional residual blocks, together with Connectionist Temporal Classification (CTC) loss and optimized beam search decoding. This approach results in significant improvements in inference speed and latency, while maintaining or surpassing competitive word error rates (WER) in both clean and noisy speech recognition scenarios (Pratap et al., 2020).

1. Mathematical Formulation of the TDS Block

A TDS block is defined as a residual 1D convolution that is both time- and depth-separable. The structural parameters are denoted as TDS( $c, k_w, w, r$ ), where $c$ is the number of channel-groups, $w$ is the width of each group (so total channels $C = c \cdot w$ ), $k_w$ is the temporal kernel size, and $r$ is the right padding (number of future frames used for context).

The block comprises the following stages:

Layer normalization is applied along the channel and group width axes, with statistics computed for each individual time-step:

$\mu_t = \frac{1}{C} \sum_{i=1}^C X_{t,i}, \quad \sigma_t^2 = \frac{1}{C} \sum_{i=1}^C (X_{t,i} - \mu_t)^2, \quad \widetilde X_{t,i} = \gamma_i \frac{X_{t,i} - \mu_t}{\sqrt{\sigma_t^2 + \epsilon}} + \beta_i.$

Grouped 1D convolution over time: The normalized input is split into $c$ contiguous groups. Each group $g$ undergoes temporal convolution using group-specific kernel $K^{(g)}$ with asymmetric padding $c$ 0:

$c$ 1

The outputs for all groups are concatenated across the channel dimension.

Pointwise (1×1) convolution: The output $c$ 2 is transformed by a learnable matrix $c$ 3 and bias $c$ 4:

$c$ 5

Residual connection and ReLU:

$c$ 6

Factorization into grouped and pointwise convolutions significantly reduces parameter count compared to dense temporal convolution. Asymmetric padding, with small $c$ 7, restricts the required future context, trading off accuracy for lower latency.

2. Model Architecture and Layerwise Specifications

The TDS-ConvNet acoustic encoder consists of a sequence of convolutional and TDS blocks, processing 80-dimensional log-Mel filterbank features sampled at 10 ms intervals. The architecture is as follows:

Layer(s)	Specification	Output Shape
0 (Input)	80-dim log-Mel filterbank, 10 ms stride	$c$ 8
1	1×1 conv (80→256), $c$ 9, $w$ 0, pad=2	$w$ 1
2–3	2× TDS( $w$ 2), no subsample	$w$ 3
4	1×1 conv (800→512), $w$ 4, $w$ 5, pad=2	$w$ 6
5–7	3× TDS( $w$ 7), no subsample	$w$ 8
8	1×1 conv (512→512), $w$ 9, $C = c \cdot w$ 0, pad=2	$C = c \cdot w$ 1
9–12	4× TDS( $C = c \cdot w$ 2), no subsample	$C = c \cdot w$ 3
13	1×1 conv (512→512), $C = c \cdot w$ 4, $C = c \cdot w$ 5, pad=0	$C = c \cdot w$ 6
14–18	5× TDS( $C = c \cdot w$ 7), no subsample	$C = c \cdot w$ 8
19 (Output)	Linear (512→N), log-softmax, CTC	$C = c \cdot w$ 9

The model contains approximately 104 million parameters, with an overall 8× subsampling factor due to three stride-2 convolutions. Future context per TDS block is one frame ( $k_w$ 0 10 ms), accumulating to about 250 ms across all TDS layers.

3. Temporal Dynamics, Receptive Field, and Latency

The effective receptive field is governed by the progression of convolutional and TDS layers: $k_w$ 1 with $k_w$ 2 and $k_w$ 3 as the kernel size and stride, respectively, of layer $k_w$ 4. For the three stride-2 convolutions ( $k_w$ 5) and all TDS blocks ( $k_w$ 6), the full receptive field is approximately 10 seconds (i.e., 1000 frames at 10 ms interval).

Future context (right padding) only accumulates across the $k_w$ 7 TDS layers: $k_w$ 8

Constraining $k_w$ 9 to small values—here, $r$ 0—dramatically reduces total model-induced latency. Operational end-to-end latency is the sum of acoustic model future context (~250 ms), decoder delay, and audio chunking overhead.

4. CTC Objective and Sequence Decoding

The encoder output is projected to $r$ 1 vocabulary logits per frame, followed by log-softmax. The sequence is trained with Connectionist Temporal Classification (CTC), with loss

$r$ 2

where $r$ 3 is the set of all length- $r$ 4 alignments (with insertion of blank tokens) that collapse to $r$ 5. Standard forward–backward dynamic programming is used, without modification of the canonical CTC formulation.

Online decoding employs a prefix beam search algorithm over the CTC output, incorporating an $r$ 6-gram LLM using simple log-linear score interpolation.

5. Performance Metrics and Trade-Offs

TDS-ConvNet systems are benchmarked against strong low-latency baselines (LC-BLSTM + LF-MMI and LC-BLSTM + RNN-T), with the following results:

Metric	LC-BLSTM + LF-MMI	LC-BLSTM + RNN-T	TDS conv + CTC
Parameters	80 M	60 M	104 M
Inference precision	INT8	INT8	FP16
WER (vid-clean)	14.10%	13.93%	13.19%
WER (vid-noisy)	22.15%	22.58%	21.16%
Throughput (sec audio/sec)	55	64	147
RTF@40 streams	0.70	0.60	0.26
User-perceived latency (40 str)	1.18 s	–	1.09 s

Ablation on future context vs WER demonstrates that reducing future context from 5 s to 0.25 s incurs only a ~4% relative WER degradation (vid-clean: 12.65% → 13.19%; vid-noisy: 20.44% → 21.16%) while reducing model latency by an order of magnitude. TDS-ConvNet achieves approximately three times the throughput of an optimized hybrid baseline, with significantly reduced real-time factor (RTF).

6. Decoder Optimizations and Practical Considerations

The optimized wav2letter++ CTC beam-search decoder is further enhanced by two additional pruning strategies: 1. Acoustic-pruning: During beam expansion, retain only the top- $r$ 7 tokens by local acoustic score ( $r$ 8). 2. Blank-pruning: If $r$ 9, only the blank arc is extended.

Combined with the encoder's 8× subsampling, decoding constitutes approximately 5% of total inference time.

Training and inference notes:

Training is performed with SpecAugment and local mean/variance normalization (window: 300 frames, ≈3 s) to enable online normalization.
Inference leverages FB GEMM for efficient mixed-precision (FP16) group convolutions.
Chunk size (e.g., 750 ms) and the number of concurrent streams (e.g., 40–60) can be tuned to balance latency and throughput.

The overall architecture—fully convolutional and devoid of recurrent connections—enables streaming-friendly operation, often with higher throughput and lower latency than traditional RNN-based ASR systems, while holding competitive or superior recognition accuracy (Pratap et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Scaling Up Online Speech Recognition Using ConvNets (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TDS-ConvNets.