End-to-End Neural Diarization (EEND)

Updated 4 December 2025

EEND is a neural diarization framework that directly classifies frame-level speaker activity from acoustic features, effectively handling overlapped speech.
It employs deep architectures such as BLSTM, Transformer, and Conformer with permutation-invariant training to optimize diarization error rates.
Variants like LS-EEND and AED-EEND enable real-time, streaming diarization with dynamic speaker allocation and state-of-the-art performance on benchmarks.

End-to-End Neural Diarization (EEND) is a paradigm for speaker diarization that reformulates the task as direct, frame-level multi-label classification using deep neural networks. EEND approaches have demonstrated substantial superiority over clustering-based pipelines, particularly in the presence of overlapped speech, and enable the integration of advanced deep learning architectures and end-to-end optimization of Diarization Error Rate (DER) (Fujita et al., 2020). The following sections provide a comprehensive overview of EEND and its major contemporary variants, including LS-EEND, covering architectural principles, streaming and long-form diarization, training methodologies, empirical performance, and system-level implications.

1. Core Principles and Canonical EEND Framework

The foundation of EEND is the direct mapping from input acoustic features, typically log-Mel filterbanks with context stacking, to frame-wise speaker activity posteriors $\hat y_{s,t} \in [0,1]$ for a set of candidate speakers $s$ , at every frame $t$ . The network structure consists of:

A deep encoder, originally BLSTM or Transformer-based, that produces frame-level embeddings $e_t$ .
A multi-label, permutation-invariant training objective (PIT), which considers all $S!$ permutations of $S$ output streams and minimizes binary cross-entropy with respect to the optimal reference ordering for each utterance, thus resolving the “label permutation” ambiguity:

$\mathcal{L}_{\rm PIT} = \frac{1}{S T} \min_{\phi \in \rm{Perm}(S)} \sum_{t=1}^T \sum_{s=1}^S -y_{\phi(s),t} \log \hat y_{s,t} - (1-y_{\phi(s),t}) \log (1-\hat y_{s,t}).$

The output $\hat{\mathbf{Y}} = [\hat{y}_{s,t}]$ is interpreted as frame-level multi-label speaker activity, naturally handling speech overlaps—i.e., multiple concurrent active speakers (Fujita et al., 2020).

2. Architectural Evolution and Attractor Mechanisms

A central challenge in EEND is supporting variable and flexible numbers of speakers. The Encoder-Decoder Attractor (EDA) module enables this by producing a variable set of $S^*$ “attractor” vectors $\mathbf{a}_s$ via a recurrent sequence-to-sequence architecture (LSTM or Transformer variant), conditioned on the encoder output (Horiguchi et al., 2021, Samarakoon et al., 2023). For each attractor, the model computes $\hat{y}_{s,t} = \sigma(\mathbf{a}_s^\top e_t)$ . A speaker-existence mechanism (sigmoid-activated scalar $\hat z_s$ per attractor) is used to dynamically select the number of speakers during inference.

Notable architectural variants and enhancements include:

Transformer and Conformer Encoders: Employ multi-head self-attention (Transformer), or hybrid convolutional-attention blocks (Conformer) to capture both long-range speaker relations and local context. Conformer-based EEND further reduces DER, especially when conversational statistics match target data (Liu et al., 2021, Liang et al., 9 Oct 2024).
Attention-based Decoder Replacements: Transformer-based attractor decoders (as in AED-EEND, EEND-TA) improve both convergence and efficiency compared to LSTM-based EDA (Chen et al., 2023, Samarakoon et al., 2023).
Embedding Enhancement: Embedding enhancer modules (cross-attention refinements post-attractor extraction) further improve discrimination and robustness to unseen speaker numbers (Chen et al., 2023).

Standard self-attention incurs quadratic complexity in sequence length, limiting conventional EEND to short to medium-length audio. LS-EEND achieves true online, frame-synchronous diarization with linear temporal complexity, enabling diarization of hour-long recordings in streaming mode (Liang et al., 9 Oct 2024).

Causal Conformer Encoder: Each block uses multi-head "retention" (a recurrence-compatible attention surrogate), causal convolutions (preserving causality), and L2-normalization of embeddings.
Online Attractor Decoder: Maintains per-speaker attractors $a_{s,t}$ , updated at every frame using along-time retention (O( $T D^2$ )), and cross-attractor self-attention in the speaker dimension for enhanced speaker separation.
Retention Mechanism: Retention, replacing self-attention softmax with an exponential decay mask, allows efficient accumulation of past context in both training (parallel recurrent updates) and inference (sequential recurrence).
Frame-in-Frame-out Processing: Every input frame directly updates all speaker attractors and yields a diarization prediction with minimal delay (≤1 s latency with lookahead).
Progressive Training: Multi-stage curriculum over increasing speaker count and audio length, chunk-wise retention during long-form adaptation, and output-anchored losses for direct scale-up.
Performance: LS-EEND achieves new state-of-the-art online DER on CALLHOME (12.11%), DIHARD II (27.58%), DIHARD III (19.61%), and AMI (20.76%) with linear time complexity (RTF ≈ 0.028), outperforming buffer-based and block-wise streaming systems by 3–7 DER points while reducing computational cost by an order of magnitude (Liang et al., 9 Oct 2024).

4. Training Methodologies and Data Simulation

EEND training is characterized by large-scale simulation, domain adaptation, and recent advances in "teacher forcing" and enhanced data synthesis:

Data Simulation: Both simulated mixtures (random pause distributions) and more realistic simulated conversations (empirical pause/overlap statistics) are used. The latter better emulates target data conversational structure, reducing reliance on fine-tuning and improving generalization (Landini et al., 2022).
Curriculum Learning: Progressive increase in the number of speakers and utterance length, as in LS-EEND's staged training (Liang et al., 9 Oct 2024).
Loss Functions: PIT binary cross-entropy for diarization, auxiliary losses (e.g., embedding similarity, intermediate attractor losses), and cross-entropy over auxiliary subtasks (speech activity detection, overlap detection) in multitask setups (Chen et al., 2023, Takashima et al., 2021).
Pseudo-label and Semi-supervised Training: Iterative pseudo-labeling and committee-based fusion allow effective adaptation to unlabeled domains, yielding up to 37.4% relative DER reduction without ground-truth frame annotations (Takashima et al., 2021).

5. Empirical Performance and Benchmarking

EEND and its derivatives establish state-of-the-art results across diarization benchmarks. Representative best accuracies as reported:

System	CALLHOME	DIHARD II	DIHARD III	AMI	RTF
LS-EEND (Liang et al., 9 Oct 2024)	12.11%	27.58%	19.61%	20.76%	0.028
AED-EEND+Enh/Conformer(Chen et al., 2023)	10.08%	24.64%	13.00%	—	—

LS-EEND and AED-EEND variants, without oracle SAD, robustly outperform prior online and offline systems, including those based on offline EEND-EDA, FLEX-STB, and buffer-based inference (Liang et al., 9 Oct 2024, Chen et al., 2023).

6. Extensions: Multi-Channel, Unlimited-Speaker, and System Calibration

Multi-Channel Diarization: EEND extends naturally to distributed microphone settings via spatio-temporal and co-attention encoder variants, leveraging spatial diversity for superior DER, even in asynchronous or spatially ambiguous conditions (Horiguchi et al., 2021).
Unlimited-number-of-Speakers: Offline and online block-wise inference coupled with clustering over local attractors allows EEND-GLA to handle recordings with more speakers than seen during training, by relaxing the cap on output tracks via post-hoc clustering (Horiguchi et al., 2022).
Calibration and Fusion: Probability-level calibration (joint or per-speaker), probability-space fusion, and "Fuse-then-Calibrate" schemes significantly reduce DER and enable use of EEND outputs in risk-aware diarization or system combination. Soft probability fusion outperforms hard-segmentation methods such as DOVER-Lap (Alvarez-Trejos et al., 27 Nov 2025).

7. Methodological Implications and Future Directions

EEND and LS-EEND unify embedding extraction, attractor generation, and diarization into a single, end-to-end optimized, causal network, enabling deployment in real-time, low-latency scenarios such as meeting transcription, conferencing, and streaming ASR front-ends. While current architectures cap maximum speakers by decoder dimension or block design, ongoing advances in unsupervised attractor clustering, dynamic attractor selection, semi-supervised adaptation, and multimodal conditioning continue to extend EEND's flexibility and accuracy envelope (Liang et al., 9 Oct 2024, Alvarez-Trejos et al., 27 Nov 2025, Horiguchi et al., 2022).

Recent results confirm that self-attention and retention-based architectures, progressive and adversarial training, probability calibration, and multi-task learning (including ASR feature conditioning) all contribute significantly to state-of-the-art diarization performance, robust to long-session, multi-speaker, and overlapping-speech conditions.