Blockwise EDA-EEND for Streaming Diarization
- The paper presents a novel streaming diarization architecture that integrates a causal Transformer encoder with blockwise recurrent LSTM attractor modules for online processing.
- It processes audio in fixed-length blocks using localized left-context attention, ensuring linear-time computation and low latency in meeting and conversational settings.
- Experimental results demonstrate competitive diarization error rates across 1-4 speakers, balancing latency and accuracy compared to offline systems.
Blockwise Encoder–Decoder–Attractor EEND (BW-EDA-EEND) is a neural architecture for streaming end-to-end speaker diarization that supports a variable number of speakers. BW-EDA-EEND processes input incrementally by combining a causal (left-context) Transformer encoder with blockwise recurrent LSTM attractor modules, allowing linear-time computation and low-latency diarization outputs. The system generalizes the original offline EDA-EEND framework by adapting both the embedding computation and attractor inference to operate on short, locally contextualized audio blocks, thereby enabling online diarization for realistic meeting and conversational settings (Han et al., 2020).
1. Encoder–Decoder–Attractor (EDA) Framework
BW-EDA-EEND builds on the EDA-EEND model, whose core architecture comprises:
- Transformer Encoder: Maps an input feature sequence to frame embeddings .
- LSTM Encoder: Consumes and emits final state .
- LSTM Decoder: Recursively generates a sequence of attractors , where each functions as a speaker-specific prototype.
- Speaker Activity Scoring: Computes speaker activity , where and denotes the elementwise sigmoid.
- Attractor Existence Probabilities: For each attractor , a scalar determines whether to stop decoding attractors based on a threshold .
Attractors are learned end-to-end and serve as fixed points representing speaker identities, so the dot product directly reflects the likelihood of speaker speaking at frame . This enables joint overlap detection, speaker counting, and diarization without external clustering.
The original EDA-EEND operates in batch mode, requiring access to all frames before producing outputs. BW-EDA-EEND restructures this paradigm for blockwise, streaming deployment.
2. Incremental Blockwise Transformer Encoding
The BW-EDA-EEND model segments the input into consecutive, non-overlapping blocks of frames each, with . For block and Transformer layer , hidden states are computed by:
- Query Construction: , with .
- Key/Value Construction: Keys and values are the concatenation of the current and previous blocks’ hidden states at the previous layer,
- Attention Update: .
This left-contextualized attention restricts each block’s attention scope to its recent predecessors and itself, controlling both latency and memory. The full context is approximated in an online fashion analogous to the memory mechanism in Transformer-XL.
The resulting computation for each block is , and across the sequence is (for blocks), guaranteeing linear complexity in input length for fixed and .
3. Attractor Computation: Unlimited-Latency and Limited-Latency Modes
BW-EDA-EEND provides two core modes, each representing a different latency-accuracy tradeoff:
- Unlimited-Latency (UL): Embeddings are computed blockwise, but all blocks are concatenated at the end. Attractors are decoded over the entire utterance through a single LSTMEncoder and LSTMDecoder, yielding diarization output after all audio is consumed.
- Limited-Latency (LL): At each block , attractors are decoded using an LSTM over only the most recent blocks’ embeddings (i.e., ). The decoder state is recurrent across blocks, ensuring speaker labels are non-decreasing in number—no speaker “disappears” once discovered. Diarization outputs for segment are computed blockwise, with maximum latency of one block.
Typical settings for experiments use s blocks (100 frames at 100 ms sampling), and either (neighbor block context) or (full left context).
4. Complexity Analysis
Denoting as the total number of frames, the block size, , and the context window, the main computational properties are:
- Time Complexity: Per-block attention costs , so total runtime is , i.e., linear in for fixed and .
- Space Complexity: Requires storage for caching hidden states from the previous blocks, in addition to model parameters.
The blockwise approach enables practical streaming on long audio without quadratically growing compute or memory as in full self-attention.
5. Objective Functions and Learning
Diarization is formulated as binary prediction per frame and speaker:
- Activity Prediction:
- Diarization Loss:
- Attractor Existence Loss:
For :
- Total Loss:
, typically with .
During training, the stop attractor at is targeted for probability 0, preventing overcounting. Losses are summed over active speakers and frames.
6. Experimental Protocols and Results
The experimental evaluation covers simulated and real data:
- Datasets: Simulated 1–4 speaker mixtures (from Switchboard, SRE, MUSAN noise, RIRs) and CALLHOME English (2–6 speaker calls, with 250 adaptation and test samples).
- Features: 23-dimensional log-Mel with 7 frame context, subsampled to 100 ms (yielding 345-dimensional input per frame).
- Model: 4-layer Transformer (256 units, 4 heads), LSTM encoder/decoder with .
- Training schedule: Pretrain on 2-speaker simulated, finetune on 1–4 speaker simulated, adapt on CALLHOME adaptation.
Diarization error rates (DER) reflect the trade-offs of streaming approaches:
| Model | 1 spk | 2 spks | 3 spks | 4 spks |
|---|---|---|---|---|
| Offline x-vector | 37.4% | 7.7% | 11.5% | 22.4% |
| Offline EDA-EEND | 0.27% | 4.18% | 9.66% | 14.2% |
| BW-EDA-EEND-UL () | 0.28% | 4.22% | 11.2% | 21.0% |
| BW-EDA-EEND-UL () | 0.30% | 4.42% | 13.4% | 22.7% |
| BW-EDA-EEND-LL (blockwise, ) | 1.03% | 6.10% | 12.6% | 19.2% |
On CALLHOME 2-speaker test, DERs are: offline x-vector (15.45%), offline EDA-EEND (9.02%), BW-EDA-EEND-UL (), and BW-EDA-EEND-LL ( with cross-block shuffling).
These results show small DER degradations (≤2%) relative to offline EDA-EEND for up to 2 speakers in unlimited-latency mode and slightly larger gaps for more than 2 speakers. In all cases, BW-EDA-EEND outperforms classical clustering for up to 4 speakers with sufficient context.
7. Practical Implementation and Limitations
Efficient implementation in blockwise streaming can be summarized by the following outlined steps (LL mode) at each block:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
initialize transformer_cache = [] initialize lstm_enc_state, lstm_dec_state = zero_states() for b in 1..B: # process block by block X_b = extract_block_features(audio, block_index=b) E_b^0 = X_b for i in 1..n_layers: KV_context = concat(transformer_cache[-L .. end], E_b^{i-1}) Q = E_b^{i-1} E_b^i = TransformerLayer_i(Q, KV_context) E_b = E_b^n_layers append(transformer_cache, E_b) I_b = concat(transformer_cache[-L .. end]) (lstm_enc_state) = LSTMEncoder(I_b, previous_enc_state) attractors_b = [] dec_state = lstm_enc_state for s in 1..S_max+1: a_s, dec_state = LSTMDecoder(input=zero, state=dec_state) p_s = sigmoid(Linear(a_s)) if p_s < threshold and s>1: break attractors_b.append(a_s) S_b = max(prev_S_b, len(attractors_b)-1) attractors_b = reorder_to_previous(attractors_b, prev_attractors) attractors_b = (attractors_b + prev_attractors) / 2 Yhat_b = sigmoid(E_b @ attractors_b[:S_b]^T) prev_attractors = attractors_b prev_S_b = S_b prev_enc_state = lstm_enc_state |
Heuristics for attractor-to-speaker assignment across blocks (cosine reordering, averaging, speaker-order shuffling) are critical to prevent catastrophic error rate increases.
Trade-offs include the accuracy-latency relationship (UL mode is more accurate but higher latency; LL mode offers near-realtime output at the expense of DER increases by 2–3%), and the complexity-context relationship (increased yields better accuracy but higher per-block complexity and memory).
Open challenges include long-session attractor consistency (“speaker ID drift”), improved realistic simulation (e.g., moving to LibriCSS-style mixtures), extension to far-field data, reducing latency below the block level, and developing end-to-end learned heuristics for attractor management and attention parameterization.
BW-EDA-EEND establishes a practical framework for streaming neural diarization with competitive accuracy and algorithmic scalability on variable numbers of speakers (Han et al., 2020).