Papers
Topics
Authors
Recent
2000 character limit reached

DOVER-Lap: Overlap-Aware Diarization Fusion

Updated 4 December 2025
  • DOVER-Lap is an innovative framework that aggregates multiple overlap-aware speaker diarization systems to accurately label overlapping speech segments.
  • It employs a greedy approximation for global label mapping using a cost tensor to minimize speaker confusion and missed speech errors.
  • Empirical results on benchmarks like AMI, LibriCSS, and DIHARD III demonstrate its effectiveness in reducing DER compared to single system outputs.

DOVER-Lap is an algorithmic framework for combining the outputs of multiple overlap-aware speaker diarization systems. It is designed to aggregate heterogeneous diarization hypotheses—including those produced by clustering-based, neural, or semi-supervised systems—while explicitly accommodating time regions with overlapping speakers. This approach generalizes the DOVER (Diarization Output Voting Error Reduction) method and introduces rigorous voting and label-mapping strategies that address the combinatorial and accuracy challenges posed by overlapping speech scenarios (Raj et al., 2020).

1. Context: Speaker Diarization and Overlap-Aware Modeling

Speaker diarization is defined as the task of segmenting an audio recording into speaker-homogeneous segments, answering the core question of "who spoke when?". Traditional diarization systems typically extract frame-level embeddings such as i-vectors or x-vectors and cluster them under the simplifying assumption that every frame is monophonic (containing only one speaker). However, naturalistic speech—e.g., meetings, dinner parties—features overlapping speech in 10–40% of its regions, violating this single-speaker assumption and resulting in missed speech (MS) errors if overlaps are disregarded.

Recent advances, including overlap detection plus resegmentation, end-to-end neural diarization (EEND), target-speaker voice activity detection (TS-VAD), and region proposal networks (RPN), enable the assignment of multiple speakers per frame. Each model type exhibits different strengths and weaknesses across disjoint subsets of utterances or overlap patterns, motivating the need for robust ensemble combination strategies. Off-the-shelf approaches such as naïve majority voting or DOVER, which restrict themselves to single-label assignments per time region, are inherently incapable of faithfully preserving or modeling overlap (Raj et al., 2020).

2. DOVER: Foundation and Limitations

The DOVER method is an ensemble technique developed to combine KK single-speaker diarization hypotheses H1,,HKH_1,\dots,H_K via two main steps:

  • Incremental Global Label Mapping: Speaker labels from each system are aligned incrementally using pairwise bipartite matching (typically the Hungarian algorithm) to maximize overlap duration between mapped labels. Overlaps MijM_{ij} between speaker labels ii and jj across different systems serve as the similarity metric.
  • Region-Level Weighted Voting: The union of all hypothesized segment boundaries from the aligned hypotheses is used to partition time into non-overlapping regions {τ}\{\tau\}. In each region, DOVER assigns a single speaker label via a weighted majority vote:

lτ=argmaxnk=1Kwk1(lτk=H^n),l_\tau = \arg\max_{n}\sum_{k=1}^K w_k\,\mathbf{1}(l^k_\tau=\hat H^n),

where wkw_k is a confidence weight. Only the most-voted speaker label is assigned to each region.

DOVER's single-label constraint means that all speech overlap is converted to missed speech, thereby failing to exploit improvements in overlap-aware diarization modules (Raj et al., 2020).

3. DOVER-Lap: Algorithmic Contributions

DOVER-Lap ("DOVER with Overlap", often referred to as "DL") extends DOVER both in the label-mapping and voting stages to enable seamless fusion of overlapping diarization annotations:

3.1 Global Label Mapping with Cost Tensor

  • Let HkH_k denote the kk-th hypothesis with NkN_k local speaker labels.
  • Construct a KK-dimensional cost tensor C(i1,...,iK)C(i_1, ..., i_K) for all tuples (i1,...,iK)(i_1,...,i_K) with ik{1,...,Nk}i_k \in \{1,...,N_k\}:

C(i1,...,iK)=1a<bKMia,ibC(i_1, ..., i_K) = -\sum_{1\le a<b\le K} M_{i_a,i_b}

where MijM_{ij} is the normalized temporal overlap ratio between speaker labels ii and jj in their originating hypotheses.

  • The combinatorial label mapping objective seeks a covering set of tuples M\mathcal{M} such that each original label HknH_k^n appears exactly once, minimizing total cost:

minM(i1,...,iK)MC(i1,...,iK).\min_{\mathcal{M}}\sum_{(i_1,...,i_K)\in\mathcal{M}} C(i_1, ..., i_K).

  • This is equivalent to a maximum-weight matching in a KK-partite hypergraph, which is NP-hard for K>2K > 2. DOVER-Lap adopts a greedy approximation: enumerate all N1××NKN_1\times\dots\times N_K label tuples, sort by CC, and sequentially select non-conflicting tuples until all labels are covered (Raj et al., 2020).

3.2 Overlap-Aware Weighted Voting

  • For every region TT defined by the union of segment boundaries, each hypothesis kk may assign nkTn_k^T speakers.
  • DOVER-Lap estimates the number of speakers per region using a weighted rounded mean:

n^T=k=1KwknkT\hat n_T = \big\lfloor\sum_{k=1}^K w_k n_k^T\big\rceil

where wkw_k are system-specific reliability weights.

  • The n^T\hat n_T speakers with the highest vote counts are assigned to region TT. Ties are resolved by subdividing the relevant region equally among tied speakers (Raj et al., 2020).

3.3 Pseudocode Outline

Step Key Operation Output/Update
Compute overlaps MijM_{ij} Pairwise label overlaps MijM_{ij}
Build cost tensor CC C(i1,...,iK)=a<bMia,ibC(i_1,...,i_K) = -\sum_{a<b} M_{i_a,i_b} CC
Select label tuples via greedy matching Iterative non-conflicting selection Set of mappings M\mathcal{M}
Map all labels Relabel all hypotheses Globally mapped hypotheses
Partition into regions Union of all boundaries {T}\{T\}
Vote within each region Weighted voting, pick top n^T\hat n_T Multi-label fusion

4. Computational Complexity and Graph Partitioning

Analysis by Raj and Khudanpur (Raj et al., 2021) recasts DOVER-Lap's label mapping as a maximum orthogonal graph partitioning problem. Here, the KK hypotheses form the parts of a KK-partite weighted graph G\mathcal{G}, where edges are weighted by binary overlap durations. The goal is to partition the speaker-label vertices into CC (max speakers) disjoint cliques, maximizing total intra-clique edge weight.

  • Complexity: DOVER-Lap's greedy clique enumeration is O(CK)O(C^K) and is tractable only for modest KK.
  • Polynomial-time alternatives: A modified incremental DOVER with dynamic anchor merging (pairwise Hungarian, O(KC3)O(K \cdot C^3)) retains near-optimality; a (11/C)(1-1/C) approximation bound is established.
  • Randomized local search: A local search algorithm offers (1ϵ)(1-\epsilon)-optimality under mild conditions and narrows the DER gap to within tenths of a percent (Raj et al., 2021).

5. Empirical Evaluation and Ensemble Impact

DOVER-Lap consistently surpasses both naïve voting and the best single system in settings with real overlap. Key results:

  • AMI Corpus (4 speakers, ~20% overlap):
    • Best single: 21.5% DER.
    • DOVER-Lap with rank weights: 20.5% DER; missed speech reduced by 34–41% relative (Raj et al., 2020).
  • LibriCSS (8 speakers, 0–40% overlap):
    • Best single: 7.4% DER.
    • DOVER-Lap (rank-weighted): 5.4% DER across the full overlap regime (Raj et al., 2020).
  • DIHARD III Challenge:
    • Final fusion: 11.58% DER (Track 1, core 14.09%), better than any single subsystem by 1.5–2% absolute (Horiguchi et al., 2021).
    • System weights were tuned empirically for best outcome; EEND-based systems were given higher weights.

The evidence indicates that DOVER-Lap achieves its improvements primarily by reducing speaker confusion, missed speech in overlap regions, and by robustly averaging out complementary errors across ensemble systems.

6. Extensions for Multichannel and Self-Supervised Learning

DOVER-Lap has proven utility both in late fusion of multi-microphone outputs and as a driver for pseudo-labeling in self-supervised adaptation frameworks:

  • Late fusion of single-channel diarization outputs: For example, combining spectral clustering hypotheses from seven microphones with DOVER-Lap yielded 9.02% DER, outperforming both channel-average (9.40%) and early fusion with beamforming (9.33%) (Raj et al., 2020).
  • Self-supervised adaptation (SSA) in end-to-end neural diarization (EEND-VC): On CHiME-7, per-channel EEND-VC hypotheses were aligned via Hungarian assignment, then fused frame-wise by DOVER-Lap with thresholded majority voting:

Vj(t)=k=1Kwkak,j(t),gj(t)=1(Vj(t)θ),θ=12kwkV_j(t) = \sum_{k=1}^K w_k a_{k,j}(t), \quad g_j(t) = \mathbf{1}(V_j(t) \ge \theta), \quad \theta = \frac{1}{2} \sum_k w_k

Fused multi-label outputs {gj(t)}\{g_j(t)\} serve as session-specific pseudo-ground-truth for retraining, improving channel adaptation, correcting individual system errors, and preserving overlap (Tawara et al., 2023).

7. Limitations, Practical Considerations, and Future Directions

DOVER-Lap's principal strengths include plug-and-play system-agnostic combination (supporting both single-speaker and overlap-aware systems), explicit modeling of any number of simultaneous speakers, improved confusion handling via global label mapping, and suitability for large-scale or heterogeneous diarization settings.

However, several limitations are noted:

  • Suboptimal Greedy Mapping: The current greedy label mapping is not guaranteed to be optimal; integer-programming or advanced graph-partitioning algorithms may yield further improvements (Raj et al., 2020, Raj et al., 2021).
  • Scalability: The exponential scaling in KK necessitates computationally efficient heuristics or approximations for large ensembles.
  • System Diversity: When combining a mix of single-speaker and overlap-aware systems, voting thresholds may require individualized tuning.
  • Research Directions: Adaptive region definition, confidence-based system weighting, and end-to-end trainable voting schemes have been identified as next steps (Raj et al., 2020).

The framework is implemented in publicly available codebases and has been validated across multiple open-source and industrial benchmark datasets, confirming its role as a robust fusion mechanism for modern overlap-aware diarization systems (Raj et al., 2020, Raj et al., 2021, Horiguchi et al., 2021, Tawara et al., 2023).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DOVER-Lap.