End-to-End Neural Speaker Diarization

Updated 25 March 2026

The paper introduces an end-to-end framework that replaces modular pipelines by jointly learning voice activity, overlap handling, and speaker segmentation.
It leverages permutation-invariant training with self-attention and Conformer architectures to model both local and global speaker characteristics.
The approach improves diarization accuracy in complex, multi-speaker environments, achieving significant DER reductions over traditional clustering methods.

End-to-end neural speaker diarization (EEND) denotes a class of models that directly map multi-speaker audio sequences to frame-level speaker activities, eschewing the traditional modular cascade of speech activity detection, feature embedding extraction, and unsupervised clustering. EEND architectures leverage deep neural networks—especially self-attention, convolutional, and/or encoder–decoder designs—to jointly solve voice activity detection (VAD), overlapped speech handling, and speaker segmentation in a supervised, fully differentiable manner. This paradigm yields substantial accuracy gains for conversational and meeting-style speech, particularly with respect to overlapped segments, and has become the dominant approach for neural diarization research.

1. Formulation and Core Architectures

The EEND framework adopts a multi-label classification approach: given acoustic feature sequence $X = [x_1, ..., x_T]$ (where $x_t \in \mathbb{R}^F$ ), the network predicts, at each frame, binary activity vectors $y_t = [y_{t,1}, ..., y_{t,S}]$ where $y_{t,s}=1$ means speaker $s$ is active at time $t$ (Fujita et al., 2019, Fujita et al., 2020). Unlike modular pipelines, this formulation allows the model to handle overlapped speech directly, as $y_t$ can have multiple nonzero entries.

Permutation-Invariant Training is critical, as the speaker order in output activities is arbitrary. The objective minimizes binary cross-entropy (BCE) over all permutations of output–label assignments (PIT loss): $J_{\rm PIT} = \frac{1}{TC}\min_{\phi \in \text{Perm}(C)}\sum_{t=1}^T \text{BCE}(y^\phi_t, z_t)$ where $z_t$ are the output posteriors and $\phi$ is a permutation of speaker indices (Fujita et al., 2019, Fujita et al., 2020).

Self-attention-based architectures (SA-EEND): Stacking Transformer encoders has been shown to outperform BLSTM backbones by better modeling both local voice activity and global speaker characteristics through attention. In these models, each output head predicts activity for one hypothetical speaker (Fujita et al., 2020, Liu et al., 2021).

Conformer-based EEND replaces the Transformer encoder with Conformer blocks, integrating convolutional and self-attention mechanisms. This affords simultaneous modeling of both local temporal cues (for quick speaker changes) and long-range dependencies (for global speaker identity) (Liu et al., 2021).

2. Flexible Speaker-Count and Overlap Handling

Vanilla EEND is limited by a fixed number of output speaker tracks (e.g., $x_t \in \mathbb{R}^F$ 0). To support unknown or variable speaker count—critical for multi-party real-world scenarios—multiple architectural modifications have been proposed:

Encoder–Decoder Attractor (EDA): EEND-EDA leverages an LSTM sequence-to-sequence module that generates "speaker attractors" adaptively. Given the sequence of encoder embeddings, the decoder produces speaker attractor vectors $x_t \in \mathbb{R}^F$ 1, and their existence probabilities, up to a (potentially variable) number of speakers (Horiguchi et al., 2021, Landini, 2024). During inference, all speaker attractors with existence probability greater than a threshold are selected, sidestepping the need to specify $x_t \in \mathbb{R}^F$ 2 in advance. Framewise activity for each speaker is computed as

$x_t \in \mathbb{R}^F$ 3

where $x_t \in \mathbb{R}^F$ 4 is the time- $x_t \in \mathbb{R}^F$ 5 encoding.

Iterative inference extends the range of supported speaker counts by iteratively applying the decoder to unassigned signal regions, thus decoding beyond the $x_t \in \mathbb{R}^F$ 6 encountered during training (Horiguchi et al., 2021).

Perceiver-based attractors (DiaPer): Replacing the LSTM decoder with a non-autoregressive Perceiver block stack enables parallel and more stable estimation of attractor vectors. This design maintains scalability and performance with larger numbers of speakers encountered in meetings or broadcast data (Landini, 2024).

Speaker-wise Chain Rule Models: SC-EEND applies the probabilistic chain rule, decomposing the joint speaker activity posterior into a conditional chain and employing an encoder–decoder framework for sequential estimation. This enables flexible, dynamically determined output speaker count and achieves state-of-the-art DER on variable-speaker datasets (Fujita et al., 2020, Takashima et al., 2021).

3. Hybrid Integration with Clustering-Based Methods

To address the challenge of diarization in long recordings with a potentially large and unknown number of speakers, EEND models have been hybridized with blockwise clustering approaches (Kinoshita et al., 2021). A prominent example is EEND-vector-clustering, in which:

The recording is split into blocks of $x_t \in \mathbb{R}^F$ 7 frames.
The shared neural encoder produces both framewise activity probabilities and low-dimensional speaker embeddings per block.
Silent or inactive speaker tracks are excluded.
Constrained agglomerative hierarchical clustering (AHC) or spectral clustering, with cannot-link constraints (from block-wise exclusivity), is applied to group block-wise embeddings into global speaker clusters.
Outputs are stitched to recover global speaker trajectories.

This approach combines the overlap robustness of EEND with the speaker-count and sequence scalability of clustering, yielding significant improvements on real datasets such as CALLHOME, outperforming x-vector clustering and EDA-EEND (e.g., 12.22% DER vs. 15.43% for EDA-EEND on CALLHOME with oracle speaker count, $x_t \in \mathbb{R}^F$ 8 s) (Kinoshita et al., 2021).

4. Auxiliary Information and Training Enhancements

Recent EEND work incorporates ancillary cues and training strategies to further boost accuracy and robustness:

ASR-Aware EEND: Incorporation of linguistic features from ASR, such as phone, word position, and speaker-change boundaries, via feature concatenation, contextualized self-attention, or multi-task learning. Multi-task learning with ASR-derived "position-in-word" features yields a 20% relative DER reduction (Khare et al., 2022).
Pretraining with Multi-Speaker Identification: Rather than pretraining EEND directly on massive simulated conversations, a multi-speaker SID pretraining scheme uses fully overlapped mixtures generated on-the-fly from large speaker-recognition corpora, providing compact and accurate initializations for lightweight diarization models (e.g., reducing macro average DER to 11.39% across six domains) (Horiguchi et al., 30 May 2025).
Learned Summary/Tokens: Providing EEND-EDA's attractor decoder with a learned conversational summary vector, instead of zero-vector initialization, improves attractor separation and framewise assignment, resulting in ∼2% absolute DER reduction, especially notable for recordings with four or more speakers (Broughton et al., 2023).
Speaker Embedding Integration: Fusion of frame-synchronous speaker embeddings (e.g., ECAPA-TDNN) into the acoustic encoder, via concatenation, enhances cluster separability and achieves ∼10.8% relative DER reduction in two-speaker scenarios when coupled with robust VAD (Alvarez-Trejos et al., 2024).

5. Practical Considerations and Empirical Results

Data Simulation and Domain Adaptation: EEND models are data-hungry. Synthetic data following "simulated conversation" recipes—matching real turn-taking, overlap, and speech-activity statistics—produce models that generalize significantly better than those trained with naive mixture simulation (Liu et al., 2021, Landini, 2024). Fine-tuning on small held-out adaptation sets from the target domain further reduces domain mismatch, with benefits more pronounced for Conformer-based backbones (Liu et al., 2021).

Streaming and Online Diarization: Blockwise EEND (BW-EDA-EEND) enables streaming inference. By limiting context to previous blocks, this method achieves linear time complexity while closely matching offline diarization accuracy for up to two speakers, and remaining competitive for more speakers with moderate degradation (Han et al., 2020).

Word-Level and Joint ASR-Diarization Models: Speaker-attributed ASR models, such as Transcribe-to-Diarize, estimate both transcriptions and speaker activity jointly. Using internal attention mechanisms, these models predict word boundaries and speaker labels, handling unlimited numbers of speakers as long as corresponding profiles are provided. These models have set new benchmarks for DER and word-error rate (cpWER) in joint tasks, especially when the number of speakers is unknown (Kanda et al., 2021, Huang et al., 2023).

Limitations and Prospective Enhancements: While EEND advances have closed much of the gap with modular pipelines—especially for few-speaker, high-overlap conditions—performance often degrades with more speakers, shorter processing blocks, or suboptimal VAD and speaker-count estimation. Ongoing research focuses on embedding enhancement (e.g., via orthogonality or sparsity constraints), joint end-to-end learning with clustering, dynamic adaptation to the number of speakers, and domain-bridging simulation and pretraining (Mun et al., 2023, Kinoshita et al., 2021, Chen et al., 2023, Lee et al., 2024).

6. Comparative Analysis and Benchmark Results

EEND and its variants consistently outperform modular clustering-based architectures (e.g., x-vector + AHC or VBx) in scenarios with significant overlap or variable speaker count, as evidenced by the following:

On CALLHOME multi-speaker telephone conversations, EEND-EDA achieves DERs of 7.9% (2spk), 15.3% (up to 6spk), compared to 10.8% and 26.2% for the VBx modular pipeline (Landini, 2024).
The DiaPer model matches or surpasses EEND-EDA for both two-speaker and higher-speaker-count scenarios, partly due to its non-autoregressive Perceiver attractor mechanism (Landini, 2024).
EEND-vector-clustering yields 12.22% DER on CALLHOME (oracle speaker count), besting both prior EEND and clustering-based baselines (Kinoshita et al., 2021).
Attention-enhanced encoder–decoder models (AED-EEND) and variants with embedding enhancers attain new state-of-the-art on multiple datasets without external VAD: 10.08% (CALLHOME), 24.64% (DIHARD II), 13.00% (AMI) (Chen et al., 2023).
On LibriMix (Libri2Mix), EEND-DEMUX achieves 3.79% DER, outperforming EEND-EDA by 24.5% relative (Mun et al., 2023).

A summary table of representative results:

Method	Dataset	Speakers	DER (%)	Notes
EEND-EDA	CALLHOME	2	7.9	After fine-tuning, no oracle SAD
EEND-EDA	CALLHOME	up to 6	15.3
VBx AHC	CALLHOME	2	10.8	Modular pipeline
DiaPer	CALLHOME	up to 6	13.6	Non-autoregressive attractors
EEND-VC (ours)	CALLHOME	up to 6	12.22	Constrained AHC with local blocks
EEND-DEMUX	Libri2Mix	2	3.79	Demultiplexed speaker embeddings

7. Methodological Strengths, Limitations, and Future Directions

Strengths:

EEND architectures are inherently overlap-aware and permit joint VAD, overlap detection, and diarization.
Flexible extensions (EDA, Perceiver, chain-rule) enable handling of unknown and variable speaker-count scenarios.
Conformer and attention-based decoders improve both accuracy and interpretability, as seen in attention-weight analyses (Fujita et al., 2020, Liu et al., 2021).
Joint integration with clustering, supervised pretraining, or ASR-derived cues further enhances robustness to real-domain conditions.

Limitations:

Large-scale annotated data or accurate simulated data matching real statistics are required for state-of-the-art performance.
Embedding and attractor quality, speaker-count estimation, and the sensitivity to domain shifts remain open challenges.
Model scaling to very large numbers of speakers or to extremely long recordings often requires hybrid clustering or block-based decoding schemes (Kinoshita et al., 2021, Yang et al., 2022).
Most EEND architectures presume clean, frame-aligned training data; robustness to highly noisy or low-resource domains is less explored.

Research directions:

End-to-end integration of clustering constraints and blockwise label consistency into the training objective (Yang et al., 2022).
Improved simulation techniques for multi-party, far-field, and real conversational conditions (Landini, 2024, Liu et al., 2021).
Dynamic attractor count and cross-modal diarization (e.g., leveraging video or text).
Integration with word-level diarization and fully joint ASR-diarization streams (Kanda et al., 2021, Huang et al., 2023).

In summary, end-to-end neural speaker diarization constitutes a paradigm shift from modular pipelines to unified, flexible, and overlap-robust neural estimation. Recent advances have coalesced around encoder–decoder attractor mechanisms, attention-based architectures, and data-centric adaptation, yielding competitive or superior performance across a spectrum of conversation and meeting datasets. Ongoing research points to deeper integration with clustering, advanced simulation and pretraining, and fully joint speech and language processing engines.

Key References: (Fujita et al., 2019, Fujita et al., 2020, Horiguchi et al., 2021, Kinoshita et al., 2021, Landini, 2024, Liu et al., 2021, Broughton et al., 2023, Alvarez-Trejos et al., 2024, Khare et al., 2022, Horiguchi et al., 30 May 2025, Kanda et al., 2021, Yang et al., 2022, Mun et al., 2023, Han et al., 2020, Takashima et al., 2021, Chen et al., 2023, Huang et al., 2023, Chen et al., 2023, Lee et al., 2024)