Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Encoder Transducer (DET)

Updated 15 January 2026
  • Dynamic Encoder Transducer (DET) is an ASR architecture featuring multiple encoder branches with varying depths to dynamically trade off accuracy and latency.
  • It employs structured layer dropout and collaborative learning to adjust encoder capacity at runtime without requiring retraining or fine-tuning.
  • Empirical results show that DET reduces WER and computational latency, enabling efficient on-device ASR across diverse computing environments.

The Dynamic Encoder Transducer (DET) is an architecture for end-to-end automatic speech recognition (ASR) that provides a flexible mechanism to trade off recognition accuracy against system latency without requiring retraining or fine-tuning. DET introduces a set of encoder branches with different depths in a single model, enabling dynamic selection of encoder capacity at runtime according to computing budget or real-time constraints. DET employs two principal training strategies—structured layer dropout and collaborative learning—allowing deployment on heterogeneous devices and dynamic assignment of encoder capacity within or across utterances (Shi et al., 2021).

1. Model Architecture and RNN-T Integration

DET is built upon the Recurrent Neural Network Transducer (RNN-T) framework, a standard in streaming ASR. The RNN-T consists of three main components:

  • Encoder (fef^e): Maps acoustic frame sequence X={x1,...,xT}X = \{x_1, ..., x_T\} to hidden representations:

{h1e,,hTe}=fe(X)\{h_1^e, \dots, h_T^e\} = f^e(X)

  • Predictor (fpf^p): Consumes label history (with prepended blank symbol ϕ\phi) and outputs hidden states:

{h1p,,hUp}=fp({ϕ,y1,...,yu1})\{h_1^p, \dots, h_U^p\} = f^p(\{\phi, y_1, ..., y_{u-1}\})

  • Joiner (fjf^j): Combines encoder and predictor states to produce logits for each (t,u)(t,u):

ht,u=fj(hte,hup),P(yx1:t,y1:u1)=softmax(ht,u)h_{t,u} = f^j(h_t^e, h_u^p), \quad P(y | x_{1:t}, y_{1:u-1}) = \text{softmax}(h_{t,u})

The RNN-T loss marginalizes over all possible alignment paths π\pi corresponding to the output label sequence YY, using the collapse operator B\mathcal{B}:

$\mathcal{L}_{\rm RNN\mbox{-}T} = -\sum_{(X,Y)} \log P(Y|X) = -\sum_{(X,Y)} \log\left[\sum_{\pi \in \mathcal{B}^{-1}(Y)} \prod_{t,u} P(\pi_{t,u} | X_{1:t}, Y_{1:u-1})\right]$

DET defines a set of KK encoder branches {e1,...,eK}\{e_1, ..., e_K\} with monotonically decreasing depths d1>d2>...>dKd_1 > d_2 > ... > d_K. Each branch eie_i shares the first di1d_{i-1} layers with the deepest branch e1e_1 and has an independent final layer, facilitating weight sharing and memory efficiency. Both predictor and joiner modules are shared among all branches, but each branch yields its own RNN-T loss.

2. Structured Layer Dropout for Encoder Depth Flexibility

DET implements on-demand capacity selection at inference through structured layer dropout. During training, candidate layers D\ell \in \mathcal{D} are stochastically skipped with probability pp_\ell, emulating "stochastic depth" or "layerdrop." When a layer is skipped, it is replaced by the identity, preserving residual connections:

h()=δ Layer(h(1))+(1δ)h(1),h^{(\ell)} = \delta_\ell~\text{Layer}_\ell(h^{(\ell-1)}) + (1 - \delta_\ell) h^{(\ell-1)},

where δBernoulli(1p)\delta_\ell \sim \text{Bernoulli}(1 - p_\ell) indicates retention.

The objective is the expected RNN-T loss under dropout distributions:

$\min_\theta \mathbb{E}_{\{\delta_\ell\}}\left[\mathcal{L}_{\rm RNN\mbox{-}T}(\theta; \{\delta_\ell\})\right]$

At inference, a smaller encoder (reduced depth d<dfulld'<d_\mathrm{full}) is realized by deterministically removing pre-eligible dropout layers, calibrated to meet a latency or compute budget BB, or to achieve a target real-time factor (RTF). This enables deployment of a "single binary" across devices with divergent computing resources.

3. Collaborative Learning for Multi-Depth Encoders

Collaborative learning is employed for joint optimization of encoders with multiple depths. Each branch eie_i (for i>1i>1) is treated as a "student" to the deepest (teacher) encoder e1e_1, leveraging two additional objectives:

  • Cross-entropy (CE) loss on context-dependent grapheme states (chenones).
  • Kullback–Leibler divergence (KLD) from the teacher to each student branch.

The total loss is:

L=αLCE+βLKLD+i=1KLTriL = \alpha L^{CE} + \beta L^{KLD} + \sum_{i=1}^K L^{Tr_i}

with

LCE=1Ti=1Kt=1TlogPi(ctxt),LKLD=1Ti>1t=1TPe1(ctxt)logPe1(ctxt)Pi(ctxt)L^{CE} = -\frac{1}{T} \sum_{i=1}^K \sum_{t=1}^T \log P^i(c_t|x_t), \quad L^{KLD} = -\frac{1}{T} \sum_{i>1} \sum_{t=1}^T P^{e_1}(c_t|x_t) \log \frac{P^{e_1}(c_t|x_t)}{P^i(c_t|x_t)}

Gradients from RNN-T and CE losses are back-propagated through all shared layers. KLD gradients are not propagated through the teacher branch, preventing its potential degradation. This multi-objective optimization encourages shallow branches to emulate deeper (more accurate) encodings.

4. Dynamic Encoder Assignment Policies

DET supports online switching between encoder branches according to an assignment policy. A simple fixed policy uses a time threshold KK:

  • Decode the initial KK seconds (or TKT_K frames) with the lightweight encoder eKe_K, minimizing startup or cold-start latency.
  • Switch to the full encoder e1e_1 for the remainder to maximize accuracy.

This process trades minor degradation on early utterance segments for substantial reductions in startup latency. The switching mechanism can be formalized as:

1
2
3
4
5
6
7
Given audio X={x_1,,x_T}, switch-point K (in frames):
  for t in 1..T:
    if t <= K:
      h_t^e = e_K.process(x_{1:t})
    else:
      h_t^e = e_1.process(x_{1:t})
    y_t = joiner(h_t^e, predictor(y_{1:u-1}))

More adaptive policies could base switching on CPU load monitoring or dynamic confidence estimates of partial decoding hypotheses.

5. Empirical Evaluation

DET has been benchmarked on Librispeech and proprietary in-house datasets, examining both static and dynamic encoder selection.

5.1. One-Encoder Decoding (Static Depth)

Model WER (clean/other) RTF Model Size
Baseline (20 layers) 3.62 / 9.86 0.63 77M params
Baseline retrained (14 layers) 3.87 / 10.35 0.47 58M params
Layer-drop (14 layers) 4.35 / 11.56 - -
Collab (20 layers) 3.54 / 9.04 - -
Collab (14 layers) 3.66 / 9.60 - -

On Librispeech, the full-size DET encoder trained with collaborative learning reduces WER on the "other" subset by over 8% relative to the same-size baseline. The lightweight encoder, trained collaboratively, yields a 25% model size reduction with similar WER to the full-size baseline. Analogous trends are observed on in-house data, with shallow encoders providing improved real-time factor (RTF) and startup processing latency (SPL).

5.2. Dynamic-Encoder Decoding (Mixed Depths)

Dynamic assignment, switching after 0.8s on Librispeech, yielded WERs of 3.84 / 9.49 (layer-drop) and 3.62 / 9.27 (collab), with RTFs of 0.6 and 0.56, representing 6% relative WER and up to 11% RTF reductions compared to baseline. Startup latency improvements exceeding 10% were achieved with negligible WER penalty. In-house evaluations confirm the effectiveness in reducing perceived latency and computational overhead.

6. Discussion and Implications for On-Device ASR

DET enables a continuous spectrum of accuracy–latency configurations by tuning either encoder depth or intra-utterance switching threshold. Collaborative learning ensures minimal accuracy degradation even for heavily pruned shallow encoders, while layer-drop facilitates immediate deployment flexibility without fine-tuning. DET's shared-weight architecture supports deployment of a single binary tailored at runtime for strong, mid-tier, or dynamically adaptive environments. Switching logic is memory efficient, implemented via layer skipping rather than model reloading.

Wake-word detection and low-latency interaction scenarios particularly benefit from DET, as early audio can be processed with reduced depth, deferring the full encoder until accurate wakeup is confirmed. DET surpasses naïve pruning and multi-model deployment baselines, delivering a unified solution to the trade-off between ASR accuracy and real-time constraints (Shi et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Encoder Transducer (DET).