Blockwise EDA-EEND for Streaming Diarization

Updated 8 January 2026

The paper presents a novel streaming diarization architecture that integrates a causal Transformer encoder with blockwise recurrent LSTM attractor modules for online processing.
It processes audio in fixed-length blocks using localized left-context attention, ensuring linear-time computation and low latency in meeting and conversational settings.
Experimental results demonstrate competitive diarization error rates across 1-4 speakers, balancing latency and accuracy compared to offline systems.

Blockwise Encoder–Decoder–Attractor EEND (BW-EDA-EEND) is a neural architecture for streaming end-to-end speaker diarization that supports a variable number of speakers. BW-EDA-EEND processes input incrementally by combining a causal (left-context) Transformer encoder with blockwise recurrent LSTM attractor modules, allowing linear-time computation and low-latency diarization outputs. The system generalizes the original offline EDA-EEND framework by adapting both the embedding computation and attractor inference to operate on short, locally contextualized audio blocks, thereby enabling online diarization for realistic meeting and conversational settings (Han et al., 2020).

1. Encoder–Decoder–Attractor (EDA) Framework

BW-EDA-EEND builds on the EDA-EEND model, whose core architecture comprises:

Transformer Encoder: Maps an input feature sequence $X=[x_1,\dots,x_T]\in\mathbb{R}^{T\times F}$ to frame embeddings $E=[e_1,\dots,e_T]\in\mathbb{R}^{T\times D}$ .
LSTM Encoder: Consumes $E$ and emits final state $(h_0, c_0)$ .
LSTM Decoder: Recursively generates a sequence of attractors $\{a_1,\dots,a_{S+1}\}\subset \mathbb{R}^D$ , where each $a_s$ functions as a speaker-specific prototype.
Speaker Activity Scoring: Computes speaker activity $\hat{Y}=\sigma(EA^{T})\in(0,1)^{T\times S}$ , where $A=[a_1,\dots,a_S]\in\mathbb{R}^{S\times D}$ and $\sigma$ denotes the elementwise sigmoid.
Attractor Existence Probabilities: For each attractor $a_s$ , a scalar $p_s=\sigma(\text{Linear}(a_s))$ determines whether to stop decoding attractors based on a threshold $\tau$ .

Attractors are learned end-to-end and serve as fixed points representing speaker identities, so the dot product $e_t\cdot a_s$ directly reflects the likelihood of speaker $s$ speaking at frame $t$ . This enables joint overlap detection, speaker counting, and diarization without external clustering.

The original EDA-EEND operates in batch mode, requiring access to all frames before producing outputs. BW-EDA-EEND restructures this paradigm for blockwise, streaming deployment.

2. Incremental Blockwise Transformer Encoding

The BW-EDA-EEND model segments the input into $B$ consecutive, non-overlapping blocks of $W$ frames each, $X = [X_1, \dots, X_B]$ with $X_b\in\mathbb{R}^{W\times F}$ . For block $b$ and Transformer layer $i$ , hidden states $E_b^i\in\mathbb{R}^{W\times D}$ are computed by:

Query Construction: $Q_b^i = E_b^{i-1}$ , with $E_b^0 = X_b$ .
Key/Value Construction: Keys and values are the concatenation of the current and previous $L$ blocks’ hidden states at the previous layer,

$K_b^i=V_b^i=\mathrm{Concat}(E_{b-L:b-1}^{i-1}, E_b^{i-1}).$

Attention Update: $E_b^i = \mathrm{TransformerEncoderLayer}(Q_b^i, K_b^i, V_b^i)$ .

This left-contextualized attention restricts each block’s attention scope to its recent $L$ predecessors and itself, controlling both latency and memory. The full context is approximated in an online fashion analogous to the memory mechanism in Transformer-XL.

The resulting computation for each block is $O((L+1)W^2)$ , and across the sequence is $O(TW)$ (for $B=T/W$ blocks), guaranteeing linear complexity in input length $T$ for fixed $L$ and $W$ .

3. Attractor Computation: Unlimited-Latency and Limited-Latency Modes

BW-EDA-EEND provides two core modes, each representing a different latency-accuracy tradeoff:

Unlimited-Latency (UL): Embeddings are computed blockwise, but all blocks are concatenated at the end. Attractors are decoded over the entire utterance through a single LSTMEncoder and LSTMDecoder, yielding diarization output after all audio is consumed.
Limited-Latency (LL): At each block $b$ , attractors $A_b$ are decoded using an LSTM over only the most recent $L+1$ blocks’ embeddings (i.e., $I_b = \mathrm{Concat}(E_{b-L},...,E_b)$ ). The decoder state is recurrent across blocks, ensuring speaker labels are non-decreasing in number—no speaker “disappears” once discovered. Diarization outputs for segment $b$ are computed blockwise, with maximum latency of one block.

Typical settings for experiments use $W=10$ s blocks (100 frames at 100 ms sampling), and either $L=1$ (neighbor block context) or $L=\infty$ (full left context).

4. Complexity Analysis

Denoting $T$ as the total number of frames, $W$ the block size, $B=T/W$ , and $L$ the context window, the main computational properties are:

Time Complexity: Per-block attention costs $O((L+1)W^2)$ , so total runtime is $O(B\, (L+1)W^2) = O((L+1)W T)$ , i.e., linear in $T$ for fixed $L$ and $W$ .
Space Complexity: Requires $O(LWD)$ storage for caching hidden states from the previous $L$ blocks, in addition to model parameters.

The blockwise approach enables practical streaming on long audio without quadratically growing compute or memory as in full self-attention.

5. Objective Functions and Learning

Diarization is formulated as binary prediction per frame and speaker:

Activity Prediction:

$\hat{Y}_{t, s} = \sigma(e_t \cdot a_s)$

Diarization Loss:

$\mathcal{L}_\text{diar} = -\sum_{t=1}^T\sum_{s=1}^S \big[ y_{t,s}\log \hat{y}_{t,s} + (1 - y_{t,s})\log(1 - \hat{y}_{t,s}) \big]$

Attractor Existence Loss:

For $p_s = \sigma(w^\top a_s + b)$ :

$\mathcal{L}_\text{count} = -\sum_s \big[ t_s \log p_s + (1 - t_s)\log(1-p_s)\big]$

Total Loss:

$\mathcal{L} = \mathcal{L}_\text{diar} + \alpha \mathcal{L}_\text{count}$ , typically with $\alpha = 1$ .

During training, the stop attractor at $s = S+1$ is targeted for probability 0, preventing overcounting. Losses are summed over active speakers and frames.

6. Experimental Protocols and Results

The experimental evaluation covers simulated and real data:

Datasets: Simulated 1–4 speaker mixtures (from Switchboard, SRE, MUSAN noise, RIRs) and CALLHOME English (2–6 speaker calls, with 250 adaptation and test samples).
Features: 23-dimensional log-Mel with $\pm$ 7 frame context, subsampled to 100 ms (yielding 345-dimensional input per frame).
Model: 4-layer Transformer (256 units, 4 heads), LSTM encoder/decoder with $D=256$ .
Training schedule: Pretrain on 2-speaker simulated, finetune on 1–4 speaker simulated, adapt on CALLHOME adaptation.

Diarization error rates (DER) reflect the trade-offs of streaming approaches:

Model	1 spk	2 spks	3 spks	4 spks
Offline x-vector	37.4%	7.7%	11.5%	22.4%
Offline EDA-EEND	0.27%	4.18%	9.66%	14.2%
BW-EDA-EEND-UL ( $L=\infty$ )	0.28%	4.22%	11.2%	21.0%
BW-EDA-EEND-UL ( $L=1$ )	0.30%	4.42%	13.4%	22.7%
BW-EDA-EEND-LL (blockwise, $L=\infty$ )	1.03%	6.10%	12.6%	19.2%

On CALLHOME 2-speaker test, DERs are: offline x-vector (15.45%), offline EDA-EEND (9.02%), BW-EDA-EEND-UL ( $10.05\%$ ), and BW-EDA-EEND-LL ( $11.82\%$ with cross-block shuffling).

These results show small DER degradations (≤2%) relative to offline EDA-EEND for up to 2 speakers in unlimited-latency mode and slightly larger gaps for more than 2 speakers. In all cases, BW-EDA-EEND outperforms classical clustering for up to 4 speakers with sufficient context.

7. Practical Implementation and Limitations

Efficient implementation in blockwise streaming can be summarized by the following outlined steps (LL mode) at each block:

initialize transformer_cache = []
initialize lstm_enc_state, lstm_dec_state = zero_states()
for b in 1..B:                # process block by block
  X_b = extract_block_features(audio, block_index=b)
  E_b^0 = X_b
  for i in 1..n_layers:
    KV_context = concat(transformer_cache[-L .. end], E_b^{i-1})
    Q = E_b^{i-1}
    E_b^i = TransformerLayer_i(Q, KV_context)
  E_b = E_b^n_layers
  append(transformer_cache, E_b)
  I_b = concat(transformer_cache[-L .. end])
  (lstm_enc_state) = LSTMEncoder(I_b, previous_enc_state)
  attractors_b = []
  dec_state = lstm_enc_state
  for s in 1..S_max+1:
    a_s, dec_state = LSTMDecoder(input=zero, state=dec_state)
    p_s = sigmoid(Linear(a_s))
    if p_s < threshold and s>1:
      break
    attractors_b.append(a_s)
  S_b = max(prev_S_b, len(attractors_b)-1)
  attractors_b = reorder_to_previous(attractors_b, prev_attractors)
  attractors_b = (attractors_b + prev_attractors) / 2
  Yhat_b = sigmoid(E_b @ attractors_b[:S_b]^T)
  prev_attractors = attractors_b
  prev_S_b = S_b
  prev_enc_state = lstm_enc_state

Heuristics for attractor-to-speaker assignment across blocks (cosine reordering, averaging, speaker-order shuffling) are critical to prevent catastrophic error rate increases.

Trade-offs include the accuracy-latency relationship (UL mode is more accurate but higher latency; LL mode offers near-realtime output at the expense of DER increases by 2–3%), and the complexity-context relationship (increased $L$ yields better accuracy but higher per-block complexity and memory).

Open challenges include long-session attractor consistency (“speaker ID drift”), improved realistic simulation (e.g., moving to LibriCSS-style mixtures), extension to far-field data, reducing latency below the block level, and developing end-to-end learned heuristics for attractor management and attention parameterization.

BW-EDA-EEND establishes a practical framework for streaming neural diarization with competitive accuracy and algorithmic scalability on variable numbers of speakers (Han et al., 2020).

PDF Markdown Chat (Pro)

References (1)

BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Blockwise EDA (BW-EDA-EEND).