Overlap-Aware Speaker Diarization

Updated 16 July 2025

Overlap-aware speaker diarization is a method that detects and labels speakers during overlapping speech, addressing key limitations of conventional systems.
Techniques involve neural architectures, resegmentation, power set encoding, and graph-based clustering to optimize detection and assignment in mixed-speech regions.
Empirical results demonstrate significant DER reductions and improved performance in applications such as meeting transcription, broadcast audio, and call center analytics.

Overlap-aware speaker diarization refers to a class of methods and systems that explicitly detect and assign speaker labels during regions of overlapping speech—intervals where two or more speakers are active simultaneously—thereby overcoming one of the principal limitations of conventional diarization approaches. Accurate overlap handling is crucial in conversational domains such as meetings, interviews, and broadcast audio, where speaker co-occurrence is frequent. Developments in overlap-aware diarization have progressed from neural overlap detection and resegmentation methods to end-to-end systems, clustering techniques, ensemble strategies, and graph-theoretic approaches that jointly optimize segmentation and labeling in the presence of overlaps.

1. Neural Architectures for Overlap Detection

Overlap-aware diarization often begins with precise overlapped speech detection (OSD). Early neural architectures treat OSD as a sequence labeling task, with the input sequence of acoustic feature vectors $X = \{x_1, x_2, ..., x_T\}$ and corresponding labels $y_t = 0$ (non-overlapped) or $1$ (overlapped). Bi-directional LSTM networks, frequently stacked in multiple layers with output softmax classification, form the backbone of many systems. Architectures may leverage hand-crafted features (e.g., MFCCs with derivatives) or trainable front-ends like SincNet. Salient training procedures include artificial data augmentation by summing random audio sequences to create overlapped training examples, enhancing model robustness to multi-speaker interactions.

During inference, overlapped speech labels are produced by sliding windows with averaged predictions and a set threshold $\theta_{\mathrm{OSD}}$ . State-of-the-art performance has been demonstrated on AMI, DIHARD, and ETAPE corpora, achieving near 92% precision and significant recall enhancements. The architecture's ability to capture temporal dependencies, combined with balanced augmentation, underpins these results (1910.11646).

2. Overlap-Aware Resegmentation and Speaker Assignment

Overlap-aware resegmentation exploits detected overlap regions to perform frame-level two-speaker assignments. Following initial diarization (often using a VB-HMM pipeline with speaker posterior matrix $Q_{st}$ ), standard “hard” assignments for primary speakers are augmented by assigning a secondary label to frames flagged by OSD. This is based on the next-highest posterior probability in the speaker matrix, yielding a union hypothesis: the primary assignment for all active frames and secondary assignment for overlapped regions only.

Empirical analysis on AMI shows that this strategy produces a relative DER reduction of 20% (from 29.7% to 23.8%), driven predominantly by a 38% relative reduction in missed detections, though with a minor uptick in false alarms and overlap-related speaker confusion. The method’s greatest efficacy is observed in contexts where two-speaker overlaps comprise the majority (about 75% of overlap cases), and limitations arise when extending to higher-order (≥3 speaker) overlaps (1910.11646).

3. Single-Label Reformulations: Power Set Encoding

To explicitly model speaker dependencies and jointly predict overlapping activity, several recent diarization models adopt a single-label classification formulation with power set encoding (PSE). Given $N$ speakers, each unique combination of speakers (the power set $\mathcal{P}(N)$ ) is mapped to an integer by $PSE(S, N) = \sum_{n=1}^N \delta(n, S)\cdot 2^{n-1}$ , where $S$ is the subset of active speakers. Neural architectures predict a softmax over these categories, making overlaps intrinsically represented, eliminating heuristic thresholding, and reducing false positives in overlap regions.

The Speaker Embedding-aware Neural Diarization (SEND) framework implements this idea, leveraging both context-independent (CI) and context-dependent (CD) similarity scoring between speech and speaker embeddings, refined further by post-processing networks. When extended with word-level textual information via attention alignment, further reductions in DER are achieved in meeting transcription tasks. Experiments demonstrate up to 34% relative improvement in DER compared to HMM-based clustering baselines, with additional advantages in stability, parameter efficiency, and the accommodation of a flexible speaker count (2111.13694, 2203.09767). Similar PSE formulations underpin models such as SOND and TOLD, which achieve further improvements by integrating sequential modeling, context-aware refinement, and iterative post-processing (2211.10243, 2303.05397).

4. Clustering and Graph-Based Methods for Overlap Disambiguation

Outside of end-to-end neural pipelines, spectral clustering approaches have been adapted to be overlap-aware. One method formulates clustering as a convex optimization, with the assignment matrix $X$ relaxed to a continuous variable $Z$ , and eigen-decomposition providing the solution. Discretization is guided by an overlap detector, enabling overlapped segments to belong to two clusters via constrained non-maximal suppression.

Graph-based clustering frameworks advance this by modeling speaker relationships as graphs: nodes represent speech segments, and refined edge weights capture local and global affinities. Community detection algorithms such as Leiden clustering are used to optimize global partitions, while overlap is addressed by allowing each segment to have multiple community labels—the so-called “overlapped community detection” paradigm. Innovations include Graph Convolutional Network (GCN) or Graph Attention Network (GAT) refinement of edge weights and sophisticated label propagation schemes that assign multiple speaker labels when necessary. State-of-the-art results are reported on corpora such as DIHARD-III, with DERs as low as 11.07% under oracle VAD (2306.14530, 2506.02610).

5. Hybrid, Adaptive, and Ensemble Systems

Overlap-adaptive systems dynamically select between end-to-end overlap-robust segmentation (e.g., WavLM-Conformer architectures) and traditional clustering-based diarization depending on the observed overlap proportion. For instance, in the MISP 2025 Challenge, the system switches to end-to-end segmentation when overlap exceeds 1%, and otherwise relies on VBx clustering, enabling robust and flexible adaptation across varying scenarios (2505.22013).

Ensemble strategies such as DOVER-Lap generalize label fusion to operate with overlap-aware hypotheses. Using global k-partite graph matching and rank-weighted voting, DOVER-Lap accommodates multiple active speakers per region, outperforming the best single system by over 30–40% DER reductions on standard datasets, and proving effective for late fusion in multichannel diarization (2011.01997).

Sparse optimization frameworks offer another paradigm; for example, by jointly factorizing the embedding signal into a basis and activation matrices with $\ell_1$ regularization. The innate linearity in embeddings enables the representation and disambiguation of overlapping speech without ad hoc post-processing, producing notable improvements in purity and F-score and rendering the approach language-agnostic and tuning-free (2207.12504).

6. Overlap-Aware Diarization in Real-World Settings and Current Challenges

Overlap-aware diarization is now an essential feature in real-world applications: meeting transcription, multilingual conversational analysis, call center analytics, and speaker-attributed ASR. Methods such as utterance-by-utterance overlap-aware diarization with Graph-PIT avoid the segmentation dilemmas of prior EEND-clustering hybrids by coloring overlapping utterances on separate channels and clustering utterance-level embeddings. This approach achieves marked DER reduction and enables coherent interaction with downstream ASR modules (2207.13888).

Comprehensive systems now integrate diarization and ASR in a cascaded architecture, leveraging overlap-aware diarization for speaker-attributed segment extraction and employing ASR-aware weighting/fusion to overcome the limitations of speech separation under low SNR conditions, as evidenced by top performance in recent evaluation campaigns (2505.22013, 2506.05796).

Nevertheless, significant challenges remain. Expanding from two-speaker overlaps to multi-party, highly overlapping meetings demands scalable encoding and clustering paradigms (as the power set size grows exponentially). Furthermore, the sensitivity to annotation boundary mismatch across ASR- and diarization-oriented corpora has been shown to affect both system performance and transferability. Forced alignment preprocessing and morphological closing post-processing help mitigate these issues, but further improvements in standardized boundary conventions and robust transfer learning remain open areas for research (2507.09226).

7. Future Directions and Open Problems

Future work in overlap-aware diarization is driven by several lines:

Generalization to arbitrary (≥3) speaker overlaps, necessitating scalable and efficient encoding.
Joint optimization with downstream ASR, leveraging structured diarization outputs or triplet-based conditioning of LLM decoders for precise, speaker-attributed transcriptions in multi-speaker, multi-lingual settings (2506.05796).
Robustness to label boundary imprecision and dataset-specific segmentations, motivating further attention to data standardization (2507.09226) and simulation-agnostic pretraining strategies (2505.24545).
Improved overlap detectors (e.g., using spatial information in multi-channel setups or integrating cross-modal cues) and the further unification of detection and assignment stages into learnable, end-to-end frameworks.
Algorithmic enhancements to ensemble combination (e.g., more accurate k-partite matching for DOVER-Lap), and the extension of ensemble and graph-based methods to online and real-time deployments.

The field continues to integrate overlap-awareness as a fundamental property, reshaping both evaluation standards and state-of-the-art diarization systems across research and industrial domains.