Papers
Topics
Authors
Recent
2000 character limit reached

Blockwise EDA-EEND for Streaming Diarization

Updated 8 January 2026
  • The paper presents a novel streaming diarization architecture that integrates a causal Transformer encoder with blockwise recurrent LSTM attractor modules for online processing.
  • It processes audio in fixed-length blocks using localized left-context attention, ensuring linear-time computation and low latency in meeting and conversational settings.
  • Experimental results demonstrate competitive diarization error rates across 1-4 speakers, balancing latency and accuracy compared to offline systems.

Blockwise Encoder–Decoder–Attractor EEND (BW-EDA-EEND) is a neural architecture for streaming end-to-end speaker diarization that supports a variable number of speakers. BW-EDA-EEND processes input incrementally by combining a causal (left-context) Transformer encoder with blockwise recurrent LSTM attractor modules, allowing linear-time computation and low-latency diarization outputs. The system generalizes the original offline EDA-EEND framework by adapting both the embedding computation and attractor inference to operate on short, locally contextualized audio blocks, thereby enabling online diarization for realistic meeting and conversational settings (Han et al., 2020).

1. Encoder–Decoder–Attractor (EDA) Framework

BW-EDA-EEND builds on the EDA-EEND model, whose core architecture comprises:

  • Transformer Encoder: Maps an input feature sequence X=[x1,,xT]RT×FX=[x_1,\dots,x_T]\in\mathbb{R}^{T\times F} to frame embeddings E=[e1,,eT]RT×DE=[e_1,\dots,e_T]\in\mathbb{R}^{T\times D}.
  • LSTM Encoder: Consumes EE and emits final state (h0,c0)(h_0, c_0).
  • LSTM Decoder: Recursively generates a sequence of attractors {a1,,aS+1}RD\{a_1,\dots,a_{S+1}\}\subset \mathbb{R}^D, where each asa_s functions as a speaker-specific prototype.
  • Speaker Activity Scoring: Computes speaker activity Y^=σ(EAT)(0,1)T×S\hat{Y}=\sigma(EA^{T})\in(0,1)^{T\times S}, where A=[a1,,aS]RS×DA=[a_1,\dots,a_S]\in\mathbb{R}^{S\times D} and σ\sigma denotes the elementwise sigmoid.
  • Attractor Existence Probabilities: For each attractor asa_s, a scalar ps=σ(Linear(as))p_s=\sigma(\text{Linear}(a_s)) determines whether to stop decoding attractors based on a threshold τ\tau.

Attractors are learned end-to-end and serve as fixed points representing speaker identities, so the dot product etase_t\cdot a_s directly reflects the likelihood of speaker ss speaking at frame tt. This enables joint overlap detection, speaker counting, and diarization without external clustering.

The original EDA-EEND operates in batch mode, requiring access to all frames before producing outputs. BW-EDA-EEND restructures this paradigm for blockwise, streaming deployment.

2. Incremental Blockwise Transformer Encoding

The BW-EDA-EEND model segments the input into BB consecutive, non-overlapping blocks of WW frames each, X=[X1,,XB]X = [X_1, \dots, X_B] with XbRW×FX_b\in\mathbb{R}^{W\times F}. For block bb and Transformer layer ii, hidden states EbiRW×DE_b^i\in\mathbb{R}^{W\times D} are computed by:

  • Query Construction: Qbi=Ebi1Q_b^i = E_b^{i-1}, with Eb0=XbE_b^0 = X_b.
  • Key/Value Construction: Keys and values are the concatenation of the current and previous LL blocks’ hidden states at the previous layer,

Kbi=Vbi=Concat(EbL:b1i1,Ebi1).K_b^i=V_b^i=\mathrm{Concat}(E_{b-L:b-1}^{i-1}, E_b^{i-1}).

  • Attention Update: Ebi=TransformerEncoderLayer(Qbi,Kbi,Vbi)E_b^i = \mathrm{TransformerEncoderLayer}(Q_b^i, K_b^i, V_b^i).

This left-contextualized attention restricts each block’s attention scope to its recent LL predecessors and itself, controlling both latency and memory. The full context is approximated in an online fashion analogous to the memory mechanism in Transformer-XL.

The resulting computation for each block is O((L+1)W2)O((L+1)W^2), and across the sequence is O(TW)O(TW) (for B=T/WB=T/W blocks), guaranteeing linear complexity in input length TT for fixed LL and WW.

3. Attractor Computation: Unlimited-Latency and Limited-Latency Modes

BW-EDA-EEND provides two core modes, each representing a different latency-accuracy tradeoff:

  • Unlimited-Latency (UL): Embeddings are computed blockwise, but all blocks are concatenated at the end. Attractors are decoded over the entire utterance through a single LSTMEncoder and LSTMDecoder, yielding diarization output after all audio is consumed.
  • Limited-Latency (LL): At each block bb, attractors AbA_b are decoded using an LSTM over only the most recent L+1L+1 blocks’ embeddings (i.e., Ib=Concat(EbL,...,Eb)I_b = \mathrm{Concat}(E_{b-L},...,E_b)). The decoder state is recurrent across blocks, ensuring speaker labels are non-decreasing in number—no speaker “disappears” once discovered. Diarization outputs for segment bb are computed blockwise, with maximum latency of one block.

Typical settings for experiments use W=10W=10 s blocks (100 frames at 100 ms sampling), and either L=1L=1 (neighbor block context) or L=L=\infty (full left context).

4. Complexity Analysis

Denoting TT as the total number of frames, WW the block size, B=T/WB=T/W, and LL the context window, the main computational properties are:

  • Time Complexity: Per-block attention costs O((L+1)W2)O((L+1)W^2), so total runtime is O(B(L+1)W2)=O((L+1)WT)O(B\, (L+1)W^2) = O((L+1)W T), i.e., linear in TT for fixed LL and WW.
  • Space Complexity: Requires O(LWD)O(LWD) storage for caching hidden states from the previous LL blocks, in addition to model parameters.

The blockwise approach enables practical streaming on long audio without quadratically growing compute or memory as in full self-attention.

5. Objective Functions and Learning

Diarization is formulated as binary prediction per frame and speaker:

  • Activity Prediction:

Y^t,s=σ(etas)\hat{Y}_{t, s} = \sigma(e_t \cdot a_s)

  • Diarization Loss:

Ldiar=t=1Ts=1S[yt,slogy^t,s+(1yt,s)log(1y^t,s)]\mathcal{L}_\text{diar} = -\sum_{t=1}^T\sum_{s=1}^S \big[ y_{t,s}\log \hat{y}_{t,s} + (1 - y_{t,s})\log(1 - \hat{y}_{t,s}) \big]

  • Attractor Existence Loss:

For ps=σ(was+b)p_s = \sigma(w^\top a_s + b):

Lcount=s[tslogps+(1ts)log(1ps)]\mathcal{L}_\text{count} = -\sum_s \big[ t_s \log p_s + (1 - t_s)\log(1-p_s)\big]

  • Total Loss:

L=Ldiar+αLcount\mathcal{L} = \mathcal{L}_\text{diar} + \alpha \mathcal{L}_\text{count}, typically with α=1\alpha = 1.

During training, the stop attractor at s=S+1s = S+1 is targeted for probability 0, preventing overcounting. Losses are summed over active speakers and frames.

6. Experimental Protocols and Results

The experimental evaluation covers simulated and real data:

  • Datasets: Simulated 1–4 speaker mixtures (from Switchboard, SRE, MUSAN noise, RIRs) and CALLHOME English (2–6 speaker calls, with 250 adaptation and test samples).
  • Features: 23-dimensional log-Mel with ±\pm7 frame context, subsampled to 100 ms (yielding 345-dimensional input per frame).
  • Model: 4-layer Transformer (256 units, 4 heads), LSTM encoder/decoder with D=256D=256.
  • Training schedule: Pretrain on 2-speaker simulated, finetune on 1–4 speaker simulated, adapt on CALLHOME adaptation.

Diarization error rates (DER) reflect the trade-offs of streaming approaches:

Model 1 spk 2 spks 3 spks 4 spks
Offline x-vector 37.4% 7.7% 11.5% 22.4%
Offline EDA-EEND 0.27% 4.18% 9.66% 14.2%
BW-EDA-EEND-UL (L=L=\infty) 0.28% 4.22% 11.2% 21.0%
BW-EDA-EEND-UL (L=1L=1) 0.30% 4.42% 13.4% 22.7%
BW-EDA-EEND-LL (blockwise, L=L=\infty) 1.03% 6.10% 12.6% 19.2%

On CALLHOME 2-speaker test, DERs are: offline x-vector (15.45%), offline EDA-EEND (9.02%), BW-EDA-EEND-UL (10.05%10.05\%), and BW-EDA-EEND-LL (11.82%11.82\% with cross-block shuffling).

These results show small DER degradations (≤2%) relative to offline EDA-EEND for up to 2 speakers in unlimited-latency mode and slightly larger gaps for more than 2 speakers. In all cases, BW-EDA-EEND outperforms classical clustering for up to 4 speakers with sufficient context.

7. Practical Implementation and Limitations

Efficient implementation in blockwise streaming can be summarized by the following outlined steps (LL mode) at each block:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
initialize transformer_cache = []
initialize lstm_enc_state, lstm_dec_state = zero_states()
for b in 1..B:                # process block by block
  X_b = extract_block_features(audio, block_index=b)
  E_b^0 = X_b
  for i in 1..n_layers:
    KV_context = concat(transformer_cache[-L .. end], E_b^{i-1})
    Q = E_b^{i-1}
    E_b^i = TransformerLayer_i(Q, KV_context)
  E_b = E_b^n_layers
  append(transformer_cache, E_b)
  I_b = concat(transformer_cache[-L .. end])
  (lstm_enc_state) = LSTMEncoder(I_b, previous_enc_state)
  attractors_b = []
  dec_state = lstm_enc_state
  for s in 1..S_max+1:
    a_s, dec_state = LSTMDecoder(input=zero, state=dec_state)
    p_s = sigmoid(Linear(a_s))
    if p_s < threshold and s>1:
      break
    attractors_b.append(a_s)
  S_b = max(prev_S_b, len(attractors_b)-1)
  attractors_b = reorder_to_previous(attractors_b, prev_attractors)
  attractors_b = (attractors_b + prev_attractors) / 2
  Yhat_b = sigmoid(E_b @ attractors_b[:S_b]^T)
  prev_attractors = attractors_b
  prev_S_b = S_b
  prev_enc_state = lstm_enc_state

Heuristics for attractor-to-speaker assignment across blocks (cosine reordering, averaging, speaker-order shuffling) are critical to prevent catastrophic error rate increases.

Trade-offs include the accuracy-latency relationship (UL mode is more accurate but higher latency; LL mode offers near-realtime output at the expense of DER increases by 2–3%), and the complexity-context relationship (increased LL yields better accuracy but higher per-block complexity and memory).

Open challenges include long-session attractor consistency (“speaker ID drift”), improved realistic simulation (e.g., moving to LibriCSS-style mixtures), extension to far-field data, reducing latency below the block level, and developing end-to-end learned heuristics for attractor management and attention parameterization.

BW-EDA-EEND establishes a practical framework for streaming neural diarization with competitive accuracy and algorithmic scalability on variable numbers of speakers (Han et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Blockwise EDA (BW-EDA-EEND).