Unified Diarization-ASR Models

Updated 9 November 2025

Unified diarization-ASR models are integrated architectures that jointly perform speaker diarization and speech recognition, streamlining the process into a single optimization problem.
Architectural innovations such as shared encoders, multi-task heads, and serialized output training enable simultaneous prediction of speaker identity and spoken content.
Joint training using composite loss functions and task-specific predictors yields significant improvements in diarization error rate and word error rate, even in challenging overlapping speech scenarios.

Unified end-to-end diarization-ASR models are neural architectures that jointly infer both "who spoke when" (diarization) and "what was said" (automatic speech recognition, ASR) from raw audio, treating them as a single, unified machine learning problem. Unlike cascaded pipelines where diarization and ASR are addressed by separate models in a sequential manner, unified models optimize both outputs with interdependence, leveraging the shared audio representation for improved alignment, efficiency, and robustness—particularly in the presence of overlapping speech and conversational phenomena.

1. Architectural Principles and Model Variants

Unified diarization-ASR models are characterized by integrating representations, either through shared encoders, multi-task heads, or sequence-to-sequence formulations in which speaker and word labels are predicted simultaneously. The architectural taxonomy comprises:

Multi-speaker encoders with dedicated heads for diarization, separation, and ASR, exploiting shared semantic hierarchies (e.g., UME architecture) (Shakeel et al., 28 Aug 2025).
End-to-end sequence transducers, such as RNN-T or encoder-decoder models, where word and speaker predictions are aligned and often coupled via auxiliary or parallel network branches operating at the word, frame, or subword levels (Huang et al., 2023, Ghosh et al., 14 Jul 2025).
Large multimodal LLMs (MLLMs), such as SpeakerLM (Yin et al., 8 Aug 2025), which concatenate audio, speaker, and text embeddings and predict interleaved speaker and transcript tokens autoregressively.
Serialized output training (SOT) encoder-decoder models, which output ordered sequences of speaker-attributed transcripts covering all utterances in a window or conversation (Kanda et al., 2021, Cornell et al., 2023, Mao et al., 2020).

Emerging models often support word-level, subword-level, or frame-level speaker attribution, with precise time-stamping as an auxiliary prediction for downstream alignment.

2. Information Integration and Layerwise Encoding

Unified models exploit hierarchical representations in shared encoders to provide semantically distinct information for each task. For example, bottom encoder layers emphasize source characteristics (e.g., speaker identity), while higher layers encode lexical/phonetic content. Mechanisms for information extraction include:

Residual Weighted-Sum Encoding (RWSE): For each task head, a softmax-normalized vector determines the contribution from each shared encoder layer. The sum is then "residualized" by adding the final layer's output, ensuring access to both intermediate and high-level abstractions. Each task thus automatically learns to emphasize the encoder depths most beneficial for diarization, separation, or ASR (Shakeel et al., 28 Aug 2025).
Task-specific predictors: Separate predictors for diarization and ASR are used if their contexts differ. For speaker-role diarization (RD), an LSTM with full history provides superior role prediction, whereas a CNN with two-token context suffices for word prediction (Ghosh et al., 14 Jul 2025).

This stratified utilization of the encoder's semantic hierarchy has been shown to stabilize training, prevent divergence in multitask settings, and maximize performance gains from multi-head optimization.

3. Joint Training and Multi-Objective Optimization

Unified end-to-end models typically optimize a composite loss function that linearly combines task-specific losses, enabling gradient flow across all outputs and enabling synergy:

Diarization Loss: Framewise or word-aligned binary cross-entropy, or, for permutation-invariant prediction, a minimum over possible speaker-label assignments.
ASR Loss: Connectionist Temporal Classification (CTC), attention, or sequence cross-entropy loss (with or without SOT).
Separation Loss: (Where explicit source separation is included as in UME) scale-invariant signal-to-noise ratio (SI-SNR) loss with permutation alignment.

A canonical example is the UME model, where the composite loss is $\mathcal{L}_{all} = \lambda_{diar}\mathcal{L}_{diar} + \lambda_{sep}\mathcal{L}_{sep} + \lambda_{asr}\mathcal{L}_{asr}$ with near-equal weights across tasks, improving robustness and error consistency across diarization, separation, and ASR (Shakeel et al., 28 Aug 2025).

In language-model-based systems such as SpeakerLM, a single cross-entropy objective is used for the interleaved sequence of text and speaker tokens, subsuming the diarization and recognition tasks (Yin et al., 8 Aug 2025).

4. Data Handling, Pretraining, and Inference

Unified diarization-ASR models require diverse and realistically mixed data to generalize across speaker counts, overlap ratios, and noise environments. Strategies include:

Large-scale pretraining on single-speaker ASR and speaker recognition tasks to stabilize encoder representations.
Synthetic and real multichannel mixtures for joint ASR/diarization training (e.g., LibriMix, LibriSpeechMix, AliMeeting, AISHELL), with controlled overlap and noise injection (Yin et al., 8 Aug 2025, Shakeel et al., 28 Aug 2025).
Curriculum learning and staged fine-tuning: Progressively adapting modules (e.g., audio encoder, speaker projector, LLM adapters) to SDR tasks via staged joint training (Yin et al., 8 Aug 2025).

Inference strategies are tied to architectural specifics:

Permutation Invariant Training (PIT) is used to resolve label ambiguity inherent in multi-speaker outputs.
Sliding or striding window decoding partitions long-form audio into manageable windows, crucial for memory efficiency in hour-long conversations (Mao et al., 2020, Cornell et al., 2023).
Speaker clustering: For models supporting arbitrary numbers of speakers, embeddings extracted per window are globally clustered and speaker tags relabeled post-hoc (Kanda et al., 2021, Cornell et al., 2023).
In MLLM and SOT systems, absolute timestamps or segment boundaries are often handled via explicit token prediction or through dedicated timing heads.

5. Empirical Performance and Task Interdependency

Unified models outperform classic cascaded pipelines and isolated single-task baselines across overlapping, noisy, and multi-speaker conditions. Key findings:

Diarization: UME achieves state-of-the-art DER of 1.37% and 2.29% on Libri2Mix and Libri3Mix (0 s collar), improving upon EEND/ConvTasNet baselines by more than 2x (Shakeel et al., 28 Aug 2025). SpeakerLM yields cpCER of 16.05 (AliMeeting-Eval) vs. 23.20 for the cascaded SOTA (Yin et al., 8 Aug 2025).
ASR: Unified models reduce WER relative to pipeline approaches—ELAmed UME: WER=6.4% on Libri2Mix (clean) vs. 12.7% single-task (Shakeel et al., 28 Aug 2025); SLIDAR: cpWER=15.6 (AMI eval) (Cornell et al., 2023).
Separation: UME's inclusion of a separation head yields SI-SNR/SDR/STOI improvements even when directly compared to ConvTasNet baselines.

Ablations reveal that removing joint optimization channels, weighted-sum encodings, or training heads in isolation leads to substantial performance drops or unstable convergence, especially for three-speaker settings. Error propagation across diarization and ASR is substantially reduced in unified architectures, since each head directly exploits inductive biases and cleaned-up features computed by others (Shakeel et al., 28 Aug 2025, Yin et al., 8 Aug 2025).

6. Extensions, Limitations, and Future Perspectives

Contemporary unified diarization-ASR models exhibit several axes of extensibility and constraints:

Speaker registration and adaptation: MLLM approaches, such as SpeakerLM, flexibly accommodate arbitrary numbers of registered or unregistered speakers without requiring architectural change (Yin et al., 8 Aug 2025). Over-registration incurs minimal accuracy degradation.
Scalability and data efficiency: Data scaling (200 h → 7.6k h) cut cpCER nearly in half, with only marginal impact on error for out-of-domain or highly overlapping data (Yin et al., 8 Aug 2025).
Arbitrary speaker counts: Attention-based models (e.g., Transcribe-to-Diarize) and SOT-based systems (SLIDAR) naturally generalize to an unknown or unlimited number of participants (Kanda et al., 2021, Cornell et al., 2023).
Long-form and conversational phenomena: Sliding-window inference and context concatenation enable processing of hour-long meetings and real-world conversations, but global speaker consistency depends on robust embedding clustering and alignment procedures (Cornell et al., 2023).
Model limitations: Computational overhead remains significant, particularly for LLM/MLLM decoders and iterative global alignment. Some models require maximal speaker counts or prior knowledge, and performance in high-overlap or rapid turn-taking scenarios still degrades relative to oracle conditions (Saengthong et al., 26 Jun 2025, Shakeel et al., 28 Aug 2025).
Open directions: Differentiable diarization modules, extension to more than two speakers in LLMs, serializing overlapping regions, and universal cross-task timestamp supervision are cited as potential improvements.

In conclusion, unified end-to-end diarization-ASR models have redefined the practical and algorithmic landscape of "who spoke when and what," blending architectural innovation with cross-task optimization to achieve strong empirical gains and operational simplicity over cascaded systems.