Speaker Diarization and Recognition (SDR)

Updated 16 November 2025

Speaker Diarization and Recognition (SDR) is a process that segments continuous audio streams and assigns speaker identities, enabling clear multi-party analysis.
Traditional SDR systems use modular pipelines—incorporating VAD, segmentation, embedding extraction, and clustering—while newer methods integrate end-to-end, joint separation, and reinforcement learning approaches.
Modern SDR frameworks blend spatial, acoustic, lexical, and semantic cues to overcome challenges like overlapping speech, noise, and scalability in diverse real-world environments.

Speaker diarization and recognition (SDR) refers to the process of determining “who spoke when” and, in cases where a speaker inventory is available, “who said what.” SDR is operationally central to numerous multi-party audio processing tasks—including meeting transcription, broadcast analytics, diarized speech recognition, and conversational analytics. The SDR task combines accurate speaker segmentation (diarization), the attribution of segments to speaker identities (recognition), and often, temporal alignment with downstream modalities such as ASR or multimodal input.

1. Problem Formulation and Evaluation Metrics

At its core, speaker diarization partitions a continuous audio stream into temporally contiguous (potentially overlapping) segments, each assigned a speaker label, with the number of speakers being unknown in advance. Let $x(t)$ denote the input audio and $\{S_i\}$ the set of hypothesized speakers. The diarization output is a sequence of tuples $\{(t_{\mathrm{start}}^k, t_{\mathrm{end}}^k, s^k)\}$ , where each $k$ -th segment belongs to speaker $s^k$ .

Speaker recognition (“identification” in the context of SDR) attempts to associate these speaker labels either with anonymous, session-local identities or, if registered audio is available (i.e. enrollment utterances), with external speaker identities.

System outputs are evaluated with composite error metrics:

Diarization Error Rate (DER):

$\mathrm{DER} = \frac{T_{\mathrm{miss}} + T_{\mathrm{FA}} + T_{\mathrm{conf}}}{T_{\mathrm{ref}}}$

where $T_{\mathrm{miss}}$ is the total missed speech, $T_{\mathrm{FA}}$ is false alarm duration, $T_{\mathrm{conf}}$ is erroneous cross-labeling time, and $T_{\mathrm{ref}}$ is total reference speaker time.

Speaker Attributed WER (cpWER/saCER): Standard “concatenated minimum-permutation WER” (cpWER) aligns system and reference transcripts using the optimal mapping over hypothesized and reference speaker clusters. Speaker-attribution errors (saCER, $\Delta$ cp) directly quantify the impact of speaker mislabeling on recognition.

2. System Architectures: Classical, Modular, and End-to-End SDR

2.1 Classical Modular Framework

Traditional SDR systems follow a modular “pipeline,” typically:

Preprocessing: VAD to reject non-speech.
Segmentation: Uniform or change-point detection (e.g., BIC, GMM-CLR in the LIUM toolkit (Yılmaz et al., 2019)).
Embedding Extraction: Either i-vectors or neural x-vectors, computed per segment (Yılmaz et al., 2019, Raj et al., 2020).
Clustering: Agglomerative hierarchical (AHC), spectral clustering, or PLDA-based merging (Raj et al., 2020, Yılmaz et al., 2019, Park et al., 2023).
(Optional) Speaker Linking/Identification: Matching segments to registered speaker models via cosine/PLDA scoring.
Resegmentation/Postprocessing: Optionally, HMM/VB refinement or post hoc merging/splitting.

Large-scale applications (e.g., 3,000-h radio archives) demand stages that can scale memory and computation efficiently, leveraging parallel tape-level processing and sparse similarity score matrices (Yılmaz et al., 2019).

2.2 Joint Separation and Diarization

For multi-speaker or distant-mic scenarios, explicit source separation is often a prerequisite (Raj et al., 2020, Bando et al., 2024). This class includes:

Blind Source Separation with Neural Inference: e.g., multichannel neural FCA/FCASA (Bando et al., 2024), where a transformer encoder (RE-SepFormer with ISS modules) jointly learns separation and diarization heads. The diarization head outputs per-frame speaker activity probabilities $\eta_{n,t}$ ; permutation-invariant training aligns output streams with speakers via PIT.
Continuous Speech Separation (CSS): As in Microsoft’s VoxSRC entry (Xiao et al., 2020), applying Conformer-block masking followed by diarization and system-level fusion (DOVER).

2.3 End-to-End SDR and Multimodal LLM Approaches

Recent advances are represented by:

SpeakerLM: A multimodal LLM integrating audio tokenization (SenseVoice), speaker embedding cues (ERes2NetV2), and a LLM backbone (Qwen2.5) to autoregressively generate diarized, attributed transcripts. Its input format can flexibly encode both registered and unregistered speakers by prepending projected speaker cues and marker tokens (Yin et al., 8 Aug 2025).
SLIDAR (Sliding-window Diarization-Augmented Recognition): A sliding-window Transformer model produces sequence outputs including both transcripts and window-local speaker tags, then clusters the resulting embeddings globally (Cornell et al., 2023).
NSD-MS2S: Memory-aware multi-speaker embedding using a sequence-to-sequence (Seq2Seq) Transformer, with memory modules (e.g. DIM-enhanced MA-MSE) and feature-fusion in the decoder for frame-wise diarization (Yang et al., 2023).

3. Embedding Extraction and Speaker Linking

The critical element in modern SDR is the extraction of discriminative, segment-level speaker embeddings:

x-vectors/Res2Net/ECAPA-TDNN: Neural networks (typically TDNN, Res2Net, ECAPA-TDNN) are trained with additive or angular margin softmax losses, producing 128–512D embeddings for each segment (Xiao et al., 2020, Kim et al., 2021, Park et al., 2023).
PLDA Scoring: Pairwise PLDA log-likelihood ratios quantify embedding similarity, supporting clustering and cross-tape speaker linking (Yılmaz et al., 2019, Raj et al., 2020).
Cross-modal speaker cues: Some systems supplement embeddings with spatial/DOA (direction-of-arrival, (Zheng et al., 2021)), lexical (ASR-derived word embeddings, (Park et al., 2020)), or semantic features (BERT-based STD/DD, (Cheng et al., 2023)) for improved discrimination.
Enrollment Linking/Closed-set Identification: Registered speakers are matched via cosine or PLDA scores to segment embeddings; open-set recognition requires robust outlier thresholds and cluster creation (Morrone et al., 2024, Yin et al., 8 Aug 2025).

4. Clustering, Overlap Handling, and Postprocessing

4.1 Clustering

Spectral Clustering and AHC: Diarization pipelines employ spectral clustering on affinity matrices (cosine similarity, optionally lexical-augmented (Park et al., 2020)) or AHC with thresholds tuned via eigengap or a mixture-model fit (Park et al., 2023, Kim et al., 2021).
VBx/VBHMM Refinement: Variational Bayes HMMs smooth cluster assignments, exploiting frame-wise temporal continuity (Park et al., 2023).
Ensemble Fusion: DOVER-Lap (Diarization Output Voting Error Reduction, Laplacian variation) combines multiple diarization hypotheses, partitioning audio into minimal regions and harmonizing via a Hungarian matching step (Xiao et al., 2020, Park et al., 2023).

4.2 Overlap and Noise

Overlap Speech Detection (OSD): Neural models detect overlapping speech segments, which are separated using time-domain or mask-based methods (ConvTasNet, SepFormer) (Kim et al., 2021).
Leakage Filtering: After separation, energy and embedding-based filtering removes spurious or low-confidence segments (Xiao et al., 2020, Kim et al., 2021).
Joint Acoustic-Lexical/Spatial Methods: Fusion of spatial cues (e.g., speaker DOA (Zheng et al., 2021)) or semantic turn-points (Cheng et al., 2023) addresses boundary uncertainty and reduces clustering/segmentation errors.

5. End-to-End and Reinforcement Learning Paradigms

End-to-end approaches aim to integrate all SDR stages into a unified architecture:

**SpeakerLM and SLIDAR generate speaker-attributed transcripts directly, leveraging audio tokens and speaker cues as context to the LLM decoder (Yin et al., 8 Aug 2025, Cornell et al., 2023). Performance improves monotonically with training data scale, showing state-of-the-art results on both in-domain and out-of-domain benchmarks.
Reinforcement Learning (RL): Online diarization can be framed as an RL/MDP or contextual bandit, with dynamic action sets and reward feedback for online speaker assignment (Lin et al., 2023). Actions include assigning current segments to existing speaker “arms” or creating new entities, and Q-learning updates the decision policy on-the-fly, with tabular or DQN implementations. This offers adaptability for streaming or teleconference settings with dynamic speaker inventories.

6. Application Domains and Performance

SDR systems are deployed in diverse domains:

Meetings and Conferences: Real-time diarization and recognition with 2–10+ overlapping speakers, using distant or headset microphones (Raj et al., 2020, Bando et al., 2024, Cornell et al., 2023).
Broadcast and Large-Scale Archives: Scalable two-stage pipelines enable diarization and speaker linking across thousands of hours, dealing with channel, noise, and code-switching variability (Yılmaz et al., 2019).
Challenging Far-field/Noisy Environments: Microphone arrays with spatial spectrum estimation or virtual microphone simulation (Pyroomacoustics-based) enable SDR in classrooms, open offices, and far-field setups (Zheng et al., 2021, Gomez, 2022).
Evaluation: The best systems (e.g., ensemble-based fusion (Park et al., 2023), LLM-based end-to-end (Yin et al., 8 Aug 2025)) achieve DERs of 3.5–6% on VoxSRC, with cpCER and $\Delta$ cp gaps nearly closed on complex multi-speaker test sets.

7. Challenges, Limitations, and Future Directions

Overlapping Speech: DER and recognition accuracy degrade with >2 concurrent speakers unless joint separation/diarization architectures are used (Bando et al., 2024, Morrone et al., 2024).
Scalability: Large-scale linking ( $N \gg 10^4$ ) incurs $O(N^2)$ time and memory in PLDA scoring and clustering; approximate nearest-neighbor search and hierarchical blocking are promising directions (Yılmaz et al., 2019).
Error Propagation in Cascaded Pipelines: Joint or end-to-end systems (e.g., SpeakerLM, SLIDAR) mitigate the “error cascade” by training the ASR and diarization in a shared framework, which empirically reduces attribution-related WER by several absolute percent (Yin et al., 8 Aug 2025, Cornell et al., 2023).
Adaptability and Low Resource: RL and open-set recognition frameworks support speaker dynamics and adaptation to domain mismatch without retraining on large, labeled datasets (Lin et al., 2023).
Incorporation of Lexical/Semantic Information: Systems utilizing ASR-derived boundaries (word-level turn probabilities) or BERT-based dialogue segmentation have demonstrated clear reductions in DER and speaker-attribution WER, especially in rapid-turn or conversational corpora (Park et al., 2020, Cheng et al., 2023).
Generalization: The dominant trend is toward LLM-based, end-to-end architectures with modular registration support, large-scale pretraining, and flexible cross-modal integration. This suggests continued reduction in error propagation, improved generalizability across domains, and increasingly interactive or real-time SDR deployments.

In summary, speaker diarization and recognition research has advanced from modular, clustering-based pipelines to complex, flexible, and increasingly end-to-end frameworks capable of robust speaker-attributed transcription in realistic multi-speaker environments. The best systems integrate spatial, acoustic, lexical, and semantic cues, leverage LLMs and massive data, and adapt dynamically to varying speaker inventories and noise conditions, advancing SDR toward fully general and reliable multi-party conversational analysis.