Sortformer: Diarization & Multispeaker ASR

Updated 29 October 2025

Sortformer is a transformer-based model that employs a deterministic sorting objective (Sort Loss) to resolve speaker permutation issues.
It integrates speaker activity posteriors with multispeaker ASR using sinusoidal kernel embeddings for robust token alignment and improved transcription.
The model achieves strong performance on benchmarks and supports low-latency streaming via an Arrival-Order Speaker Cache for continuous speaker tracking.

Sortformer is a transformer-based neural model designed to address the speaker diarization permutation problem and its integration into multispeaker automatic speech recognition (ASR) systems. It introduces a deterministic sorting objective, termed Sort Loss, to enforce consistent speaker label ordering and streamline downstream token alignment, representation, and multispeaker transcript generation. The architecture and objectives of Sortformer enable robust, scalable speaker supervision and improved diarization performance across challenging conversational environments.

1. Architectural Principles and Sorting Objective

Sortformer employs an encoder-only transformer architecture, typically using Fast-Conformer and NEST self-supervised learning backbones. Unlike permutation-invariant diarization models (e.g., EEND-GLA, AED-EEND), Sortformer produces multilabel frame-wise predictions for up to $K$ speakers, with speaker output streams sorted by arrival time—the sequence in which speakers first become active in the audio.

Relative positional embeddings, specifically those from Shaw et al. (2018), are used in transformer layers to permit active sorting by arrival time, breaking permutation equivariance and enabling the transformation of time-local cues into deterministic output assignments. This capability is essential for consistent correspondence between model outputs and reference annotations during both training and inference.

2. The Permutation Problem and Sort Loss

The permutation problem describes the ambiguity in diarization output tracks, where model streams do not inherently correspond to specific reference speakers. Prior systems resolve this using Permutation Invariant Loss (PIL): for each sample, all $K!$ permutations of the output are considered, and the lowest-error assignment is backpropagated. This introduces computational complexity and complicates integration with token-level objectives in ASR.

Sort Loss is formulated to resolve this ambiguity directly by sorting targets and predictions by speaker arrival time. For binary activity targets $\mathbf{Y}$ and predicted posteriors $\mathbf{P}$ , the speakers are reordered such that the $k$ -th row corresponds to the $k$ -th arriving speaker. The loss is then computed as

$\mathcal{L}_{\mathrm{Sort}}(\mathbf{Y}, \mathbf{P}) = \frac{1}{K} \sum_{k=1}^K \mathcal{L}_{\mathrm{BCE}}(\mathbf{y}_{\eta(k)}, \mathbf{q}_k)$

where $\eta(\cdot)$ returns the arrival-ordered index and $\mathcal{L}_{\mathrm{BCE}}$ is the binary cross-entropy loss.

Sort Loss reduces complexity to $O(K\log K)$ and facilitates joint training with ASR token objectives. Empirical results indicate only marginal DER increases versus PIL and improved consistency in downstream integration.

A hybrid loss (Sort + PIL) can be used:

$\mathcal{L}_{\mathrm{Hybrid}} = \alpha \cdot \mathcal{L}_{\mathrm{Sort}} + (1-\alpha) \cdot \mathcal{L}_{\mathrm{PIL}}$

where $\alpha$ is tuned for dataset robustness.

3. Integration with Multispeaker ASR

Sortformer enables direct use of speaker activity posteriors in ASR via sinusoidal kernel-based embeddings. For $K$ speakers and encoder dimension $M$ , the speaker kernels are defined as:

$\mathbf{\kappa}_{k,z} = \sin\left(\frac{2\pi k z}{M}\right)$

for speaker $k$ and channel $z$ . These are assembled into matrix $\mathbf{\Gamma}$ , and used to encode the ASR state with speaker activity:

$\tilde{\mathbf{A}} = \frac{\mathbf{A}}{\|\mathbf{A}\|_2} + \mathbf{\Gamma}^T \cdot \mathbf{P}$

where $\mathbf{A}$ is the original encoder activation and $\mathbf{P}$ the diarization output. This soft supervision enables plug-and-play integration, flexible fine-tuning, and preservation of speaker identity across overlapping speech or rapid turn-taking scenarios.

In multispeaker ASR, special tokens (e.g., {<|spk0|>}) are injected at segment or word boundaries in accordance with speaker change points, maintaining the sorted correspondence between diarization and transcript serialization.

4. Streaming Sortformer and Speaker Cache Mechanism

Streaming Sortformer extends Sortformer to online, low-latency applications by utilizing an Arrival-Order Speaker Cache (AOSC). The AOSC orders per-frame acoustic embeddings for each speaker by arrival time, dynamically updating cache contents based on model probabilities and recency. At each inference step, embeddings from the cache, frames in a FIFO context queue, and current input are concatenated for prediction. The cache update is governed by scoring:

$S_i = \log P_i + \sum_{j \neq i} \log(1 - P_j)$

Frames are retained or pruned based on $S_i$ , with silence embeddings appended and recency/speaker coverage enforced. This mechanism enables consistent speaker assignment without attractors or explicit permutation alignment across streaming segments.

FIFO buffering compensates for context loss in short chunks, and cache update handles speaker continuity when audio spans chunk boundaries.

5. Experimental Results and Comparative Analysis

Sortformer has been evaluated across standard benchmark datasets (DIHARD III, CALLHOME, CH109) and realistic synthetic corpora (LibriConvo) (Gedeon et al., 27 Oct 2025). In LibriConvo, Sortformer (diar_sortformer_4spk-v1) was evaluated without dataset-specific fine-tuning and achieved the following DER:

Model	Validation DER (%)	Test DER (%)
Pyannote	25.6	24.4
Sortformer	12.9	11.1

Sortformer more than halves DER versus the pyannote pipeline, with a narrower DER distribution across recordings, indicating robustness in overlapping, rapid turn-taking dialogues. In broader benchmarks (Park et al., 10 Sep 2024), Sortformer using hybrid loss matches or outperforms EEND-GLA/Large and EEND-EDA on both 2-speaker and multi-speaker subsets. In multispeaker ASR, Sortformer supervision yields substantial improvements in both WER and cpWER, and fine-tuning the diarization module further boosts ASR accuracy.

Streaming Sortformer maintains competitive DER at low latencies (as low as 0.32s chunk) and does not degrade substantially versus offline versions, confirming its robustness for real-time applications (Medennikov et al., 24 Jul 2025).

6. Implementation and Applicability

Implementation is performed using the NVIDIA NeMo framework, with code and checkpoints available (Park et al., 10 Sep 2024). Base encoders are pre-trained on multilingual Mel-spectrogram features; model sizes typically exceed 100M parameters. Training for Sortformer involves large-scale mixtures of real and synthetic data (e.g., 5150h simulated, 2030h real speech), with arrival-time ordering enforced in both offline and streaming tasks. Streaming Sortformer additionally employs cache permutation augmentation to boost robustness.

Sortformer offers plug-and-play speaker supervision for ASR via sinusoidal kernels, straightforward token alignment through speaker-ordered serialization, and compatibility with adapter-based fine-tuning for efficient task specialization.

7. Impact and Future Directions

Sortformer advances the state-of-the-art for speaker diarization and multispeaker ASR by resolving the permutation problem through deterministic arrival-time ordering, improving training efficiency, and simplifying integration into downstream systems. Its robustness to overlapping speech and conversational dynamics is validated in diverse datasets, and streaming extensions enable low-latency, real-time speaker tracking without the need for complex attractors or explicit permutation mechanisms. Sortformer establishes a high baseline for diarization, with substantial improvements in DER and ASR metrics, and its architecture is extensible to joint supervision and large-scale, realistic deployment.

Sortformer's plug-and-play design and released implementation facilitate scalable research and practical application in multi-speaker audio analytics, conversational AI, and speech transcription pipelines.

PDF Markdown Chat (Pro)

References (3)

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization (2025)

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens (2024)

Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering (2025)

Follow Topic

Get notified by email when new papers are published related to Sortformer Model.