Dynamic Encoder Transducer (DET)
- Dynamic Encoder Transducer (DET) is an ASR architecture featuring multiple encoder branches with varying depths to dynamically trade off accuracy and latency.
- It employs structured layer dropout and collaborative learning to adjust encoder capacity at runtime without requiring retraining or fine-tuning.
- Empirical results show that DET reduces WER and computational latency, enabling efficient on-device ASR across diverse computing environments.
The Dynamic Encoder Transducer (DET) is an architecture for end-to-end automatic speech recognition (ASR) that provides a flexible mechanism to trade off recognition accuracy against system latency without requiring retraining or fine-tuning. DET introduces a set of encoder branches with different depths in a single model, enabling dynamic selection of encoder capacity at runtime according to computing budget or real-time constraints. DET employs two principal training strategies—structured layer dropout and collaborative learning—allowing deployment on heterogeneous devices and dynamic assignment of encoder capacity within or across utterances (Shi et al., 2021).
1. Model Architecture and RNN-T Integration
DET is built upon the Recurrent Neural Network Transducer (RNN-T) framework, a standard in streaming ASR. The RNN-T consists of three main components:
- Encoder (): Maps acoustic frame sequence to hidden representations:
- Predictor (): Consumes label history (with prepended blank symbol ) and outputs hidden states:
- Joiner (): Combines encoder and predictor states to produce logits for each :
The RNN-T loss marginalizes over all possible alignment paths corresponding to the output label sequence , using the collapse operator :
$\mathcal{L}_{\rm RNN\mbox{-}T} = -\sum_{(X,Y)} \log P(Y|X) = -\sum_{(X,Y)} \log\left[\sum_{\pi \in \mathcal{B}^{-1}(Y)} \prod_{t,u} P(\pi_{t,u} | X_{1:t}, Y_{1:u-1})\right]$
DET defines a set of encoder branches with monotonically decreasing depths . Each branch shares the first layers with the deepest branch and has an independent final layer, facilitating weight sharing and memory efficiency. Both predictor and joiner modules are shared among all branches, but each branch yields its own RNN-T loss.
2. Structured Layer Dropout for Encoder Depth Flexibility
DET implements on-demand capacity selection at inference through structured layer dropout. During training, candidate layers are stochastically skipped with probability , emulating "stochastic depth" or "layerdrop." When a layer is skipped, it is replaced by the identity, preserving residual connections:
where indicates retention.
The objective is the expected RNN-T loss under dropout distributions:
$\min_\theta \mathbb{E}_{\{\delta_\ell\}}\left[\mathcal{L}_{\rm RNN\mbox{-}T}(\theta; \{\delta_\ell\})\right]$
At inference, a smaller encoder (reduced depth ) is realized by deterministically removing pre-eligible dropout layers, calibrated to meet a latency or compute budget , or to achieve a target real-time factor (RTF). This enables deployment of a "single binary" across devices with divergent computing resources.
3. Collaborative Learning for Multi-Depth Encoders
Collaborative learning is employed for joint optimization of encoders with multiple depths. Each branch (for ) is treated as a "student" to the deepest (teacher) encoder , leveraging two additional objectives:
- Cross-entropy (CE) loss on context-dependent grapheme states (chenones).
- Kullback–Leibler divergence (KLD) from the teacher to each student branch.
The total loss is:
with
Gradients from RNN-T and CE losses are back-propagated through all shared layers. KLD gradients are not propagated through the teacher branch, preventing its potential degradation. This multi-objective optimization encourages shallow branches to emulate deeper (more accurate) encodings.
4. Dynamic Encoder Assignment Policies
DET supports online switching between encoder branches according to an assignment policy. A simple fixed policy uses a time threshold :
- Decode the initial seconds (or frames) with the lightweight encoder , minimizing startup or cold-start latency.
- Switch to the full encoder for the remainder to maximize accuracy.
This process trades minor degradation on early utterance segments for substantial reductions in startup latency. The switching mechanism can be formalized as:
1 2 3 4 5 6 7 |
Given audio X={x_1,…,x_T}, switch-point K (in frames):
for t in 1..T:
if t <= K:
h_t^e = e_K.process(x_{1:t})
else:
h_t^e = e_1.process(x_{1:t})
y_t = joiner(h_t^e, predictor(y_{1:u-1})) |
More adaptive policies could base switching on CPU load monitoring or dynamic confidence estimates of partial decoding hypotheses.
5. Empirical Evaluation
DET has been benchmarked on Librispeech and proprietary in-house datasets, examining both static and dynamic encoder selection.
5.1. One-Encoder Decoding (Static Depth)
| Model | WER (clean/other) | RTF | Model Size |
|---|---|---|---|
| Baseline (20 layers) | 3.62 / 9.86 | 0.63 | 77M params |
| Baseline retrained (14 layers) | 3.87 / 10.35 | 0.47 | 58M params |
| Layer-drop (14 layers) | 4.35 / 11.56 | - | - |
| Collab (20 layers) | 3.54 / 9.04 | - | - |
| Collab (14 layers) | 3.66 / 9.60 | - | - |
On Librispeech, the full-size DET encoder trained with collaborative learning reduces WER on the "other" subset by over 8% relative to the same-size baseline. The lightweight encoder, trained collaboratively, yields a 25% model size reduction with similar WER to the full-size baseline. Analogous trends are observed on in-house data, with shallow encoders providing improved real-time factor (RTF) and startup processing latency (SPL).
5.2. Dynamic-Encoder Decoding (Mixed Depths)
Dynamic assignment, switching after 0.8s on Librispeech, yielded WERs of 3.84 / 9.49 (layer-drop) and 3.62 / 9.27 (collab), with RTFs of 0.6 and 0.56, representing 6% relative WER and up to 11% RTF reductions compared to baseline. Startup latency improvements exceeding 10% were achieved with negligible WER penalty. In-house evaluations confirm the effectiveness in reducing perceived latency and computational overhead.
6. Discussion and Implications for On-Device ASR
DET enables a continuous spectrum of accuracy–latency configurations by tuning either encoder depth or intra-utterance switching threshold. Collaborative learning ensures minimal accuracy degradation even for heavily pruned shallow encoders, while layer-drop facilitates immediate deployment flexibility without fine-tuning. DET's shared-weight architecture supports deployment of a single binary tailored at runtime for strong, mid-tier, or dynamically adaptive environments. Switching logic is memory efficient, implemented via layer skipping rather than model reloading.
Wake-word detection and low-latency interaction scenarios particularly benefit from DET, as early audio can be processed with reduced depth, deferring the full encoder until accurate wakeup is confirmed. DET surpasses naïve pruning and multi-model deployment baselines, delivering a unified solution to the trade-off between ASR accuracy and real-time constraints (Shi et al., 2021).