DOVER-Lap: Overlap-Aware Diarization Fusion
- DOVER-Lap is an innovative framework that aggregates multiple overlap-aware speaker diarization systems to accurately label overlapping speech segments.
- It employs a greedy approximation for global label mapping using a cost tensor to minimize speaker confusion and missed speech errors.
- Empirical results on benchmarks like AMI, LibriCSS, and DIHARD III demonstrate its effectiveness in reducing DER compared to single system outputs.
DOVER-Lap is an algorithmic framework for combining the outputs of multiple overlap-aware speaker diarization systems. It is designed to aggregate heterogeneous diarization hypotheses—including those produced by clustering-based, neural, or semi-supervised systems—while explicitly accommodating time regions with overlapping speakers. This approach generalizes the DOVER (Diarization Output Voting Error Reduction) method and introduces rigorous voting and label-mapping strategies that address the combinatorial and accuracy challenges posed by overlapping speech scenarios (Raj et al., 2020).
1. Context: Speaker Diarization and Overlap-Aware Modeling
Speaker diarization is defined as the task of segmenting an audio recording into speaker-homogeneous segments, answering the core question of "who spoke when?". Traditional diarization systems typically extract frame-level embeddings such as i-vectors or x-vectors and cluster them under the simplifying assumption that every frame is monophonic (containing only one speaker). However, naturalistic speech—e.g., meetings, dinner parties—features overlapping speech in 10–40% of its regions, violating this single-speaker assumption and resulting in missed speech (MS) errors if overlaps are disregarded.
Recent advances, including overlap detection plus resegmentation, end-to-end neural diarization (EEND), target-speaker voice activity detection (TS-VAD), and region proposal networks (RPN), enable the assignment of multiple speakers per frame. Each model type exhibits different strengths and weaknesses across disjoint subsets of utterances or overlap patterns, motivating the need for robust ensemble combination strategies. Off-the-shelf approaches such as naïve majority voting or DOVER, which restrict themselves to single-label assignments per time region, are inherently incapable of faithfully preserving or modeling overlap (Raj et al., 2020).
2. DOVER: Foundation and Limitations
The DOVER method is an ensemble technique developed to combine single-speaker diarization hypotheses via two main steps:
- Incremental Global Label Mapping: Speaker labels from each system are aligned incrementally using pairwise bipartite matching (typically the Hungarian algorithm) to maximize overlap duration between mapped labels. Overlaps between speaker labels and across different systems serve as the similarity metric.
- Region-Level Weighted Voting: The union of all hypothesized segment boundaries from the aligned hypotheses is used to partition time into non-overlapping regions . In each region, DOVER assigns a single speaker label via a weighted majority vote:
where is a confidence weight. Only the most-voted speaker label is assigned to each region.
DOVER's single-label constraint means that all speech overlap is converted to missed speech, thereby failing to exploit improvements in overlap-aware diarization modules (Raj et al., 2020).
3. DOVER-Lap: Algorithmic Contributions
DOVER-Lap ("DOVER with Overlap", often referred to as "DL") extends DOVER both in the label-mapping and voting stages to enable seamless fusion of overlapping diarization annotations:
3.1 Global Label Mapping with Cost Tensor
- Let denote the -th hypothesis with local speaker labels.
- Construct a -dimensional cost tensor for all tuples with :
where is the normalized temporal overlap ratio between speaker labels and in their originating hypotheses.
- The combinatorial label mapping objective seeks a covering set of tuples such that each original label appears exactly once, minimizing total cost:
- This is equivalent to a maximum-weight matching in a -partite hypergraph, which is NP-hard for . DOVER-Lap adopts a greedy approximation: enumerate all label tuples, sort by , and sequentially select non-conflicting tuples until all labels are covered (Raj et al., 2020).
3.2 Overlap-Aware Weighted Voting
- For every region defined by the union of segment boundaries, each hypothesis may assign speakers.
- DOVER-Lap estimates the number of speakers per region using a weighted rounded mean:
where are system-specific reliability weights.
- The speakers with the highest vote counts are assigned to region . Ties are resolved by subdividing the relevant region equally among tied speakers (Raj et al., 2020).
3.3 Pseudocode Outline
| Step | Key Operation | Output/Update |
|---|---|---|
| Compute overlaps | Pairwise label overlaps | |
| Build cost tensor | ||
| Select label tuples via greedy matching | Iterative non-conflicting selection | Set of mappings |
| Map all labels | Relabel all hypotheses | Globally mapped hypotheses |
| Partition into regions | Union of all boundaries | |
| Vote within each region | Weighted voting, pick top | Multi-label fusion |
4. Computational Complexity and Graph Partitioning
Analysis by Raj and Khudanpur (Raj et al., 2021) recasts DOVER-Lap's label mapping as a maximum orthogonal graph partitioning problem. Here, the hypotheses form the parts of a -partite weighted graph , where edges are weighted by binary overlap durations. The goal is to partition the speaker-label vertices into (max speakers) disjoint cliques, maximizing total intra-clique edge weight.
- Complexity: DOVER-Lap's greedy clique enumeration is and is tractable only for modest .
- Polynomial-time alternatives: A modified incremental DOVER with dynamic anchor merging (pairwise Hungarian, ) retains near-optimality; a approximation bound is established.
- Randomized local search: A local search algorithm offers -optimality under mild conditions and narrows the DER gap to within tenths of a percent (Raj et al., 2021).
5. Empirical Evaluation and Ensemble Impact
DOVER-Lap consistently surpasses both naïve voting and the best single system in settings with real overlap. Key results:
- AMI Corpus (4 speakers, ~20% overlap):
- Best single: 21.5% DER.
- DOVER-Lap with rank weights: 20.5% DER; missed speech reduced by 34–41% relative (Raj et al., 2020).
- LibriCSS (8 speakers, 0–40% overlap):
- Best single: 7.4% DER.
- DOVER-Lap (rank-weighted): 5.4% DER across the full overlap regime (Raj et al., 2020).
- DIHARD III Challenge:
- Final fusion: 11.58% DER (Track 1, core 14.09%), better than any single subsystem by 1.5–2% absolute (Horiguchi et al., 2021).
- System weights were tuned empirically for best outcome; EEND-based systems were given higher weights.
The evidence indicates that DOVER-Lap achieves its improvements primarily by reducing speaker confusion, missed speech in overlap regions, and by robustly averaging out complementary errors across ensemble systems.
6. Extensions for Multichannel and Self-Supervised Learning
DOVER-Lap has proven utility both in late fusion of multi-microphone outputs and as a driver for pseudo-labeling in self-supervised adaptation frameworks:
- Late fusion of single-channel diarization outputs: For example, combining spectral clustering hypotheses from seven microphones with DOVER-Lap yielded 9.02% DER, outperforming both channel-average (9.40%) and early fusion with beamforming (9.33%) (Raj et al., 2020).
- Self-supervised adaptation (SSA) in end-to-end neural diarization (EEND-VC): On CHiME-7, per-channel EEND-VC hypotheses were aligned via Hungarian assignment, then fused frame-wise by DOVER-Lap with thresholded majority voting:
Fused multi-label outputs serve as session-specific pseudo-ground-truth for retraining, improving channel adaptation, correcting individual system errors, and preserving overlap (Tawara et al., 2023).
7. Limitations, Practical Considerations, and Future Directions
DOVER-Lap's principal strengths include plug-and-play system-agnostic combination (supporting both single-speaker and overlap-aware systems), explicit modeling of any number of simultaneous speakers, improved confusion handling via global label mapping, and suitability for large-scale or heterogeneous diarization settings.
However, several limitations are noted:
- Suboptimal Greedy Mapping: The current greedy label mapping is not guaranteed to be optimal; integer-programming or advanced graph-partitioning algorithms may yield further improvements (Raj et al., 2020, Raj et al., 2021).
- Scalability: The exponential scaling in necessitates computationally efficient heuristics or approximations for large ensembles.
- System Diversity: When combining a mix of single-speaker and overlap-aware systems, voting thresholds may require individualized tuning.
- Research Directions: Adaptive region definition, confidence-based system weighting, and end-to-end trainable voting schemes have been identified as next steps (Raj et al., 2020).
The framework is implemented in publicly available codebases and has been validated across multiple open-source and industrial benchmark datasets, confirming its role as a robust fusion mechanism for modern overlap-aware diarization systems (Raj et al., 2020, Raj et al., 2021, Horiguchi et al., 2021, Tawara et al., 2023).