Joint RNN-T ASR/SD Models

Updated 2 June 2026

RNN-T Joint ASR/SD models are unified architectures that combine automatic speech recognition with speaker diarization and role labeling for streaming applications.
They employ parallel branches with shared encoders and specialized predictors to synchronize subword recognition and speaker roles using forced alignment strategies.
Diarization-driven decoding and overlap-based target arrangements enhance performance by reducing reliance on modular pipelines and improving short word recovery.

Recurrent Neural Network Transducer (RNN-T)–based models serve as a backbone for modern streaming automatic speech recognition (ASR). Recently, variants of RNN-T have been developed to jointly address ASR and speaker-related tasks—principally speaker diarization (SD, "who spoke when") and the finer-grained speaker-role diarization (RD, e.g., assigning utterances to roles like doctor/patient or host/guest). These joint architectures extend the RNN-T’s capabilities from subword recognition to "who spoke what," integrating the diarization signal at the word or subword level and enabling joint training or tightly-coupled inference. Key advances include architectural synchronization, forced alignment strategies, specialized predictor designs, and decoding methods leveraging diarization posteriors to improve short word detection. These models achieve strong performance on benchmarks, reduce reliance on modular pipelines, and inspire new decoding strategies, but introduce new trade-offs and areas for optimization (Ghosh et al., 14 Jul 2025, Sklyar et al., 2021).

1. Joint RNN-T Architectures for ASR and Diarization

Joint RNN-T ASR/SD models are constructed as networks with parallel branches, each executing an RNN-T-style transduction: the ASR branch for subword sequence prediction, and the diarization branch for per-word speaker/role labeling. Architecturally, these branches share or partially share encoders but rely on distinct predictor and joiner modules:

ASR Branch: Encodes acoustic features with convolutional subsamplers and e-branchformer layers; the prediction network (e.g., 1D-CNN with a 2-token receptive field) models linguistic context; the joiner produces logits over the subword vocabulary plus blank (Ghosh et al., 14 Jul 2025).
Diarization Branch: Takes its encoder input from an intermediate ASR encoder layer, preserving full frame rate and emitting role or speaker categories; its predictor is a unidirectional RNN (typically LSTM) consuming role history; its joiner outputs over a role inventory.
During both training and decoding, these branches are synchronized to align predictions and assign role or speaker labels to recognized words.

Multi-speaker or multi-turn variants employ outputs structured as parallel channels per speaker, with arrangements to assign overlapping or sequential segments to appropriate channels, and use special tokens to mark segmentation or speaker turns (Sklyar et al., 2021).

2. Training Paradigms and Loss Functions

Training jointly for ASR and diarization presents unique challenges, notably in loss definition and temporal alignment:

ASR RNN-T Loss: Standard RNN-T (RNNT) negative log-likelihood is computed over all valid alignments mapping the transcript to a lattice of time (t) and output (u) steps. The blank symbol φ is handled using HAT-style factorization.

$\mathcal{L}_{\text{ASR}} = -\ln P(y|x) = -\ln \sum_{a \in \mathcal{A}_y} \prod_{(t, u) \in a} P(y_{t,u} \mid f_{1:T}^{\text{ASR}}(x), g_{1:u-1}^{\text{ASR}})$

Role Diarization Loss: Rather than summing over all alignment paths, forced alignment is used: the best ASR path determines word positions, and role labels are assigned at these positions only. The RD loss reduces to cross-entropy over these forced-alignment word positions:

$\mathcal{L}_{\text{RD}} = \sum_{(t,u): r_{t,u} \neq -} -\ln P(r_{t,u} | f_{1:T}^{\text{RD}}, g_{1:u-1}^{\text{RD}})$

This strategy eliminates computationally expensive blank-sharing or RNNT loss computation for the RD branch, focusing optimization where role information is meaningful and data-aligned (Ghosh et al., 14 Jul 2025).

In multi-turn multi-speaker variants, the RNNT loss is instantiated per channel, with specialized target arrangements (e.g., overlap-based, permutation-invariant) to map reference utterances to model outputs (Sklyar et al., 2021).

3. Predictor Designs and Contextual Requirements

Empirical analysis demonstrates that ASR and diarization tasks require different levels and types of context:

ASR Predictors: Word/subword prediction is nearly optimal with a short (2-token) lookback; a CNN-2 suffices. This supports the use of lightweight, high-throughput predictors for ASR.
RD Predictors: Role/speaker prediction benefits from extended linguistic history; a unidirectional LSTM is superior, as longer context reduces diarization error rates.

The result is a bifurcated model design: the ASR predictor (CNN-2) is frozen after ASR training, while the RD predictor (LSTM) is trained from scratch specifically for the role prediction task (Ghosh et al., 14 Jul 2025). Multi-turn models also use LSTM-based predictors per output channel to model sequential dependencies (Sklyar et al., 2021).

4. Diarization-Driven Decoding Strategies

Even when the RD branch is trained sparsely (i.e., only at word positions), its posterior activity can meaningfully inform ASR decoding—in particular, by identifying likely active speakers and providing a confidence signal:

Blank Suppression for Short Word Recovery: Many deletions in ASR occur on short, ambiguous words. RD posteriors are used to modulate the ASR output distribution: when a frequently deleted word candidate is highly probable, and the diarization branch is confident about a role at that step, the ASR blank posterior is suppressed and redistributed onto subword tokens. This heuristic significantly reduces word deletion errors for short words with only minor increases in insertion rates (Ghosh et al., 14 Jul 2025).
Synchronized Beam Search: During decoding, standard RNN-T beam search is performed on the ASR branch, while, at every non-blank emission, the corresponding role label is attached from the RD branch (using argmax over the joiner’s output at that (t, u-1) point). This maintains the "who spoke what" mapping in lock-step (Ghosh et al., 14 Jul 2025).
Segmentation via Special Tokens: In multi-party and multi-turn models, a dedicated segmentation token (e.g., <cot>) is added to the vocabulary and emitted inline with recognized wordpieces, allowing streaming output of both transcript and turn/segment boundaries (Sklyar et al., 2021).

5. Target Arrangement and Scalability in Multi-Speaker Streaming

Scaling joint RNN-T models to multi-speaker, multi-turn scenarios demands efficient target/channel assignment without factorial computational blowup:

Overlap-Based Target Arrangement: Instead of assigning one output channel per speaker and using permutation-invariant training (PIT), overlap-based assignment partitions speech segments across a fixed set of channels by greedily switching channels on overlap events. This O(N·U) complexity approach accommodates an arbitrary number of speaker turns with a fixed-size model (Sklyar et al., 2021).
Rich Transcription Strategy: Joint recognition and segmentation is enabled by augmenting channel outputs with segmentation tokens. The model can be trained and evaluated for both word recognition and turn segmentation in a streaming, low-latency fashion (Sklyar et al., 2021).

This approach allows the architecture to generalize beyond the maximum number of speakers seen during training, as demonstrated by the relative word error rate improvements (up to 28%) in multi-turn test conditions.

6. Performance Benchmarks and Key Empirical Findings

Unified RNN-T ASR/SD models have been empirically evaluated in several data regimes and with multiple metrics:

Model Configuration	WER (%)	R-WDER (%)	Notes
ASR(CNN-2) alone (Ghosh et al., 14 Jul 2025)	15.67	—	Baseline ASR
Role-ASR(RNN) (Ghosh et al., 14 Jul 2025)	16.23	7.6	Unified predictor; WER suffers
ASR(CNN-2)+RD(RNN) (Ghosh et al., 14 Jul 2025)	15.67	7.1	Separate predictors; best balance
+ RD-guided decoding (Ghosh et al., 14 Jul 2025)	15.65	—	WER reduction via blank suppression
PIT-MS-RNN-T w/ on-the-fly sim (Sklyar et al., 2021)	8.8	—	2-speaker, streaming SOTA on LibriSpeechMix
MT-RNN-T S_max=5 (Sklyar et al., 2021)	23.6	—	5-speaker LibriCSS, overlap-based assign

ASR(CNN-2)+RD(RNN) achieves the best observed trade-off between ASR accuracy and role-aware diarization error. Diarization-driven decoding further reduces deletion rates by a small but significant absolute margin (0.09%).

MT-RNN-T architectures employing overlap-based assignment scale efficiently and outperform traditional PIT approaches as the number of speakers/turns increases, though segmentation accuracy degrades when tested on more turns than seen in training.

7. Limitations, Open Issues, and Future Directions

Several limitations remain in joint RNN-T ASR/SD models:

Heuristic Sensitivity: Blank suppression and deletion recovery depend on per-token threshold tuning (sets D_n, α, β); generalization beyond validation domains is not guaranteed (Ghosh et al., 14 Jul 2025).
Segmentation Recall: Multi-turn models systematically under-segment when the test scenario contains more speaker turns than the training regime. This suggests a limitation in role/turn generalization and highlights the importance of diverse, high-turn-count training data (Sklyar et al., 2021).
Out-of-Domain Robustness: Diarization posteriors, while robust on in-domain data, may not reliably indicate speaker activity or role in different domains, risking increased insertions under the guided decoding heuristic (Ghosh et al., 14 Jul 2025).

Directions for further research include self-supervised or more robust blank suppression mechanisms, improved target arrangements for segment generalization, and unified architectures adaptive to an open set of speaker roles or unknown numbers of speakers. Advances in RNN-T joint module design, such as the integration of gating and bilinear pooling for improved fusion of audio and linguistic representations, are also directly applicable (Zhang et al., 2022).