Papers
Topics
Authors
Recent
2000 character limit reached

Probability-Level Fusion in EEND Systems

Updated 4 December 2025
  • The paper presents a probability-level fusion approach that aggregates frame-wise soft posteriors from multiple EEND systems to significantly reduce diarization error rates.
  • It leverages methodologies such as generalized mean fusion, ensemble calibration with Platt scaling, and risk-profiling to enhance model confidence and performance.
  • Empirical results on CallHome data indicate that a fuse-then-calibrate pipeline with dynamic logits yields up to 19% relative DER reduction compared to traditional segment-level methods.

Probability-level fusion of End-to-End Neural Diarization (EEND) systems refers to the principled combination of multiple EEND model outputs at the level of frame-wise soft posteriors, rather than aggregating hard segment-level decisions. This approach leverages the probabilistic confidence scores produced by neural diarization models, enabling sophisticated fusion and calibration schemes that benefit from the diversity and complementary strengths of various system architectures, input features, and training regimes. Probability-level fusion frameworks accommodate calibration techniques, such as Platt scaling and risk-profiling with generalized scoring rules, to yield improved Diarization Error Rate (DER) and produce well-calibrated confidence estimates for downstream applications (Alvarez-Trejos et al., 27 Nov 2025, Nelson et al., 2011).

1. Formal Problem Setting and Output Representations

Probability-level fusion aims to combine MM EEND systems, each producing, at each frame t=1,,Tt=1,\ldots,T, either:

  • A multilabel posterior pm(Mult)(t)=[pm,1(t),,pm,S(t)][0,1]S\mathbf{p}_m^{(\text{Mult})}(t) = [p_{m,1}(t),\ldots,p_{m,S}(t)] \in [0,1]^S, where SS is the number of speakers and pm,s(t)p_{m,s}(t) is the probability that speaker ss is active in frame tt under model mm.
  • A powerset posterior pm(Power)(t)=[pm,c1(t),,pm,cK(t)][0,1]K\mathbf{p}_m^{(\text{Power})}(t) = [p_{m,c_1}(t),\ldots,p_{m,c_K}(t)] \in [0,1]^K, with K=2SK=2^S and each ck{1,,S}c_k\subseteq\{1,\ldots,S\} denoting a unique subset of active speakers.

Fusion seeks an operator g(;θ)g(\cdot;\theta), parameterized (possibly) by θ\theta, to aggregate {pm()(t)}m=1M\{\mathbf{p}_m^{(\cdot)}(t)\}_{m=1}^M so as to minimize a diarization loss (e.g., frame-wise cross-entropy or DER) on held-out development data (Alvarez-Trejos et al., 27 Nov 2025). Output representations determine subsequent calibration and fusion steps; the powerset formulation models inter-speaker dependencies explicitly, while multilabel treats speakers independently.

2. Probability-Level Fusion Frameworks

Fusion at the probability level exploits both classical and modern statistical frameworks. Two major paradigms are in use:

A. Generalized Mean (α-β) Fusion

The two-parameter fusion method (Nelson et al., 2011) unifies smoothing and correlation-adjusted combination:

Mα(t,c)=(1Ni=1Npi(t,c)α)1αM_\alpha(t,c) = \left(\frac{1}{N}\sum_{i=1}^{N} p_i(t,c)^\alpha\right)^{\frac{1}{\alpha}}

where NN is the number of streams and α\alpha controls smoothing: α>0\alpha>0 sharpens toward maxima, α<0\alpha<0 biases toward minima, α0\alpha\to0 yields the geometric mean. An additional correlation exponent β[0,1]\beta\in[0,1] interprets NβN^\beta as the effective number of independent samples. The fused unnormalized score is:

S(t,c)=[Mα(t,c)]Nβ=(1Ni=1Npi(t,c)α)Nβ/αS(t,c) = [M_\alpha(t,c)]^{N^\beta} = \left(\frac{1}{N}\sum_{i=1}^N p_i(t,c)^\alpha\right)^{N^\beta/\alpha}

Normalized to yield fused probabilities:

pfused(t,c)=S(t,c)cS(t,c)p_\text{fused}(t,c) = \frac{S(t,c)}{\sum_{c'} S(t,c')}

B. Ensemble Fusion and Calibration

Recent approaches (Alvarez-Trejos et al., 27 Nov 2025) consider multiple unsupervised and supervised fusion techniques, including:

  • Average probabilities: pfused(t)=1Mmpm(t)p_\text{fused}(t) = \frac{1}{M}\sum_m p_m(t)
  • Average logits: transform logits via activation (sigmoid or softmax) post-averaging
  • Dynamic logits: weight each model by average logit scale
  • Entropy-based weighting
  • Supervised metalearning: learn a weighted linear classifier over concatenated logits to minimize cross-entropy

All schemes are compatible with both multilabel and powerset representations.

3. Calibration and Risk-Profiling

Accurate fusion must address miscalibration of the underlying systems. Platt scaling, a logistic regression-based post-hoc calibration, is employed:

  • Independent multilabel calibration: Each speaker's output is calibrated separately as pical(t)=σ(αilogpi(t)+βi)p_i^{cal}(t) = \sigma(\alpha_i \log p_i(t) + \beta_i)
  • Joint calibration: All speakers (or powerset components) are calibrated via a joint affine transformation followed by logistic or softmax activation (Alvarez-Trejos et al., 27 Nov 2025).

Additionally, risk profiling via Tsallis coupled-surprisal (Nelson et al., 2011) employs the deformed logarithm:

lnk(x)=xk1k,Sk(p)=lnk(p)=1pkk\ln_k(x) = \frac{x^k - 1}{k},\quad S_k(p) = -\ln_k(p) = \frac{1 - p^k}{k}

The average coupled-surprisal over TT frames for true labels ctrue(t)c_{true}(t) is

Sˉk=1Tt=1TSk(pfused(t,ctrue))\bar S_k = \frac{1}{T}\sum_{t=1}^T S_k(p_\text{fused}(t, c_{true}))

An effective probability is then defined via the inverse deformed exponential:

peff=[1kSˉk]1/k=(1Tt=1Tpfused(t,ctrue)k)1/kp_\text{eff} = [1 - k\bar S_k]^{1/k} = \left(\frac{1}{T}\sum_{t=1}^T p_\text{fused}(t,c_{true})^k\right)^{1/k}

By varying kk, one can evaluate decisiveness (k>0k>0), neutrality (k=0k=0), or robustness (k<0k<0) in the fused stream.

4. Comparative Empirical Results

Extensive benchmarking on the CallHome dataset with three EEND-EDA systems (distinct input features), both with and without fine-tuning, shows:

  • Proper powerset joint calibration yields up to 19% relative DER reduction for non-fine-tuned models; calibration can mitigate lack of domain adaptation—calibrated non-finetuned MFB achieves 8.397% DER compared to uncalibrated fine-tuned 8.236% (Alvarez-Trejos et al., 27 Nov 2025).
  • Joint calibration outperforms independent calibration, especially in multilabel space (e.g., up to 30% relative DER drop for ECAPA).
  • Fusion via dynamic logits, particularly in a "fuse-then-calibrate" (F→C) pipeline, surpasses segment-level voting schemes such as DOVER-Lap in terms of DER (e.g., Dynamic Logits F→C + FT achieves DER 6.543% vs 6.910% for DOVER-Lap).
  • Calibration in the powerset space offers substantial gains for individual models (MFB DER 10.874% → 8.397%), though multilabel calibration may harm single-system performance.
  • Error analysis indicates calibration reduces false alarms more than it increases misses. Fusion reduces speaker confusion, harnessing complementary system strengths.

A summary of best configurations and key metrics:

Method DER (No FT) DER (FT) BCE (No FT) BCE (FT)
Dynamic Logits F→C 7.458 6.543 0.239 0.217
DOVER-Lap Baseline 7.940 6.910

5. Guidelines and Practical Recommendations

From systematic analysis, best practices for probability-level EEND system fusion include (Alvarez-Trejos et al., 27 Nov 2025, Nelson et al., 2011):

  1. Prefer powerset output representations with joint calibration, to exploit inter-speaker dependencies and optimize calibration.
  2. Adopt "fuse-then-calibrate" processing order (F→C), requiring only a single calibration model and yielding superior DER and computational efficiency.
  3. Use dynamic logits fusion for most robust gains; this scheme weights models by logit scale, then applies a final nonlinearity.
  4. Always calibrate in the powerset space, even if subsequent fusion or evaluation is performed in multilabel.
  5. Evaluate calibration quality via proper scoring rules (e.g., binary cross-entropy) as well as DER to prevent misaligned improvements.
  6. Use a held-out calibration set drawn from the target domain; calibration may substitute for fine-tuning in low-data regimes.
  7. Apply median filtering to fused probabilities and maintain a consistent decision threshold (commonly 0.5), optionally using decision-theoretic thresholding on calibrated probabilities.
  8. Probability-level soft fusion and calibration supersede the need for hard segment-level schemes such as DOVER-Lap, enabling more flexible downstream applications.

6. Theoretical Underpinnings: Risk, Correlation, and Scoring Rules

The α–β fusion framework and Tsallis coupled-surprisal provide an interpretive bridge between fusion parameterization and algorithm risk bias (Nelson et al., 2011):

  • α controls smoothing/sharpening of fused posteriors; typical EEND fusion benefits from α ∈ [0.2, 0.6].
  • β parameterizes correlation between systems, with β ≈ 0.5–0.8 reflecting partial dependency due to shared architectures or features.
  • Risk profiling with parameter kk enables evaluation of accuracy/robustness trade-offs: k>0k > 0 for decisiveness, k<0k < 0 for robustness.
  • Minimizing average coupled-surprisal SkS_k is strictly proper for all kk, resulting in a scoring rule family continuous with log-loss at k=0k=0.
  • Effective probability peffp_\text{eff} quantifies ensemble's true-class confidence under selected risk profile.

7. Significance, Limitations, and Implications

Probability-level fusion of EEND systems overcomes limitations of segment-level majority voting methods—such as DOVER-Lap—by fully leveraging soft confidences and enabling model calibration. The integration of joint powerset calibration, dynamic logit fusion, and risk-profiling with generalized scoring rules results in superior diarization performance and better-calibrated confidence estimates, which are essential for applications that depend on probabilistic outputs.

A plausible implication is that the combination of these methods constitutes a new baseline for EEND system ensembles, since gains are observed even in the absence of fine-tuning. The adoption of calibration as an essential component, rather than a post-hoc adjustment, marks a methodological advance in speaker diarization. However, full independence of base models is rarely achieved in practice, so proper modeling of inter-system correlation via the β exponent or analogous strategies remains critical.

Further research may evaluate the extensibility of these frameworks to multi-speaker and open-domain diarization tasks, and examine the comparative robustness under highly mismatched input conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Probability-Level Fusion of EEND Systems.