End-to-End Neural Diarization (EEND)
- EEND is a neural diarization framework that directly classifies frame-level speaker activity from acoustic features, effectively handling overlapped speech.
- It employs deep architectures such as BLSTM, Transformer, and Conformer with permutation-invariant training to optimize diarization error rates.
- Variants like LS-EEND and AED-EEND enable real-time, streaming diarization with dynamic speaker allocation and state-of-the-art performance on benchmarks.
End-to-End Neural Diarization (EEND) is a paradigm for speaker diarization that reformulates the task as direct, frame-level multi-label classification using deep neural networks. EEND approaches have demonstrated substantial superiority over clustering-based pipelines, particularly in the presence of overlapped speech, and enable the integration of advanced deep learning architectures and end-to-end optimization of Diarization Error Rate (DER) (Fujita et al., 2020). The following sections provide a comprehensive overview of EEND and its major contemporary variants, including LS-EEND, covering architectural principles, streaming and long-form diarization, training methodologies, empirical performance, and system-level implications.
1. Core Principles and Canonical EEND Framework
The foundation of EEND is the direct mapping from input acoustic features, typically log-Mel filterbanks with context stacking, to frame-wise speaker activity posteriors for a set of candidate speakers , at every frame . The network structure consists of:
- A deep encoder, originally BLSTM or Transformer-based, that produces frame-level embeddings .
- A multi-label, permutation-invariant training objective (PIT), which considers all permutations of output streams and minimizes binary cross-entropy with respect to the optimal reference ordering for each utterance, thus resolving the “label permutation” ambiguity:
The output is interpreted as frame-level multi-label speaker activity, naturally handling speech overlaps—i.e., multiple concurrent active speakers (Fujita et al., 2020).
2. Architectural Evolution and Attractor Mechanisms
A central challenge in EEND is supporting variable and flexible numbers of speakers. The Encoder-Decoder Attractor (EDA) module enables this by producing a variable set of “attractor” vectors via a recurrent sequence-to-sequence architecture (LSTM or Transformer variant), conditioned on the encoder output (Horiguchi et al., 2021, Samarakoon et al., 2023). For each attractor, the model computes . A speaker-existence mechanism (sigmoid-activated scalar per attractor) is used to dynamically select the number of speakers during inference.
Notable architectural variants and enhancements include:
- Transformer and Conformer Encoders: Employ multi-head self-attention (Transformer), or hybrid convolutional-attention blocks (Conformer) to capture both long-range speaker relations and local context. Conformer-based EEND further reduces DER, especially when conversational statistics match target data (Liu et al., 2021, Liang et al., 9 Oct 2024).
- Attention-based Decoder Replacements: Transformer-based attractor decoders (as in AED-EEND, EEND-TA) improve both convergence and efficiency compared to LSTM-based EDA (Chen et al., 2023, Samarakoon et al., 2023).
- Embedding Enhancement: Embedding enhancer modules (cross-attention refinements post-attractor extraction) further improve discrimination and robustness to unseen speaker numbers (Chen et al., 2023).
3. Streaming and Long-Form Diarization (LS-EEND and Related)
Standard self-attention incurs quadratic complexity in sequence length, limiting conventional EEND to short to medium-length audio. LS-EEND achieves true online, frame-synchronous diarization with linear temporal complexity, enabling diarization of hour-long recordings in streaming mode (Liang et al., 9 Oct 2024).
- Causal Conformer Encoder: Each block uses multi-head "retention" (a recurrence-compatible attention surrogate), causal convolutions (preserving causality), and L2-normalization of embeddings.
- Online Attractor Decoder: Maintains per-speaker attractors , updated at every frame using along-time retention (O()), and cross-attractor self-attention in the speaker dimension for enhanced speaker separation.
- Retention Mechanism: Retention, replacing self-attention softmax with an exponential decay mask, allows efficient accumulation of past context in both training (parallel recurrent updates) and inference (sequential recurrence).
- Frame-in-Frame-out Processing: Every input frame directly updates all speaker attractors and yields a diarization prediction with minimal delay (≤1 s latency with lookahead).
- Progressive Training: Multi-stage curriculum over increasing speaker count and audio length, chunk-wise retention during long-form adaptation, and output-anchored losses for direct scale-up.
- Performance: LS-EEND achieves new state-of-the-art online DER on CALLHOME (12.11%), DIHARD II (27.58%), DIHARD III (19.61%), and AMI (20.76%) with linear time complexity (RTF ≈ 0.028), outperforming buffer-based and block-wise streaming systems by 3–7 DER points while reducing computational cost by an order of magnitude (Liang et al., 9 Oct 2024).
4. Training Methodologies and Data Simulation
EEND training is characterized by large-scale simulation, domain adaptation, and recent advances in "teacher forcing" and enhanced data synthesis:
- Data Simulation: Both simulated mixtures (random pause distributions) and more realistic simulated conversations (empirical pause/overlap statistics) are used. The latter better emulates target data conversational structure, reducing reliance on fine-tuning and improving generalization (Landini et al., 2022).
- Curriculum Learning: Progressive increase in the number of speakers and utterance length, as in LS-EEND's staged training (Liang et al., 9 Oct 2024).
- Loss Functions: PIT binary cross-entropy for diarization, auxiliary losses (e.g., embedding similarity, intermediate attractor losses), and cross-entropy over auxiliary subtasks (speech activity detection, overlap detection) in multitask setups (Chen et al., 2023, Takashima et al., 2021).
- Pseudo-label and Semi-supervised Training: Iterative pseudo-labeling and committee-based fusion allow effective adaptation to unlabeled domains, yielding up to 37.4% relative DER reduction without ground-truth frame annotations (Takashima et al., 2021).
5. Empirical Performance and Benchmarking
EEND and its derivatives establish state-of-the-art results across diarization benchmarks. Representative best accuracies as reported:
| System | CALLHOME | DIHARD II | DIHARD III | AMI | RTF |
|---|---|---|---|---|---|
| LS-EEND (Liang et al., 9 Oct 2024) | 12.11% | 27.58% | 19.61% | 20.76% | 0.028 |
| AED-EEND+Enh/Conformer(Chen et al., 2023) | 10.08% | 24.64% | 13.00% | — | — |
LS-EEND and AED-EEND variants, without oracle SAD, robustly outperform prior online and offline systems, including those based on offline EEND-EDA, FLEX-STB, and buffer-based inference (Liang et al., 9 Oct 2024, Chen et al., 2023).
6. Extensions: Multi-Channel, Unlimited-Speaker, and System Calibration
- Multi-Channel Diarization: EEND extends naturally to distributed microphone settings via spatio-temporal and co-attention encoder variants, leveraging spatial diversity for superior DER, even in asynchronous or spatially ambiguous conditions (Horiguchi et al., 2021).
- Unlimited-number-of-Speakers: Offline and online block-wise inference coupled with clustering over local attractors allows EEND-GLA to handle recordings with more speakers than seen during training, by relaxing the cap on output tracks via post-hoc clustering (Horiguchi et al., 2022).
- Calibration and Fusion: Probability-level calibration (joint or per-speaker), probability-space fusion, and "Fuse-then-Calibrate" schemes significantly reduce DER and enable use of EEND outputs in risk-aware diarization or system combination. Soft probability fusion outperforms hard-segmentation methods such as DOVER-Lap (Alvarez-Trejos et al., 27 Nov 2025).
7. Methodological Implications and Future Directions
EEND and LS-EEND unify embedding extraction, attractor generation, and diarization into a single, end-to-end optimized, causal network, enabling deployment in real-time, low-latency scenarios such as meeting transcription, conferencing, and streaming ASR front-ends. While current architectures cap maximum speakers by decoder dimension or block design, ongoing advances in unsupervised attractor clustering, dynamic attractor selection, semi-supervised adaptation, and multimodal conditioning continue to extend EEND's flexibility and accuracy envelope (Liang et al., 9 Oct 2024, Alvarez-Trejos et al., 27 Nov 2025, Horiguchi et al., 2022).
Recent results confirm that self-attention and retention-based architectures, progressive and adversarial training, probability calibration, and multi-task learning (including ASR feature conditioning) all contribute significantly to state-of-the-art diarization performance, robust to long-session, multi-speaker, and overlapping-speech conditions.