Sortformer: Diarization & Multispeaker ASR
- Sortformer is a transformer-based model that employs a deterministic sorting objective (Sort Loss) to resolve speaker permutation issues.
- It integrates speaker activity posteriors with multispeaker ASR using sinusoidal kernel embeddings for robust token alignment and improved transcription.
- The model achieves strong performance on benchmarks and supports low-latency streaming via an Arrival-Order Speaker Cache for continuous speaker tracking.
Sortformer is a transformer-based neural model designed to address the speaker diarization permutation problem and its integration into multispeaker automatic speech recognition (ASR) systems. It introduces a deterministic sorting objective, termed Sort Loss, to enforce consistent speaker label ordering and streamline downstream token alignment, representation, and multispeaker transcript generation. The architecture and objectives of Sortformer enable robust, scalable speaker supervision and improved diarization performance across challenging conversational environments.
1. Architectural Principles and Sorting Objective
Sortformer employs an encoder-only transformer architecture, typically using Fast-Conformer and NEST self-supervised learning backbones. Unlike permutation-invariant diarization models (e.g., EEND-GLA, AED-EEND), Sortformer produces multilabel frame-wise predictions for up to speakers, with speaker output streams sorted by arrival time—the sequence in which speakers first become active in the audio.
Relative positional embeddings, specifically those from Shaw et al. (2018), are used in transformer layers to permit active sorting by arrival time, breaking permutation equivariance and enabling the transformation of time-local cues into deterministic output assignments. This capability is essential for consistent correspondence between model outputs and reference annotations during both training and inference.
2. The Permutation Problem and Sort Loss
The permutation problem describes the ambiguity in diarization output tracks, where model streams do not inherently correspond to specific reference speakers. Prior systems resolve this using Permutation Invariant Loss (PIL): for each sample, all permutations of the output are considered, and the lowest-error assignment is backpropagated. This introduces computational complexity and complicates integration with token-level objectives in ASR.
Sort Loss is formulated to resolve this ambiguity directly by sorting targets and predictions by speaker arrival time. For binary activity targets and predicted posteriors , the speakers are reordered such that the -th row corresponds to the -th arriving speaker. The loss is then computed as
where returns the arrival-ordered index and is the binary cross-entropy loss.
Sort Loss reduces complexity to and facilitates joint training with ASR token objectives. Empirical results indicate only marginal DER increases versus PIL and improved consistency in downstream integration.
A hybrid loss (Sort + PIL) can be used:
where is tuned for dataset robustness.
3. Integration with Multispeaker ASR
Sortformer enables direct use of speaker activity posteriors in ASR via sinusoidal kernel-based embeddings. For speakers and encoder dimension , the speaker kernels are defined as:
for speaker and channel . These are assembled into matrix , and used to encode the ASR state with speaker activity:
where is the original encoder activation and the diarization output. This soft supervision enables plug-and-play integration, flexible fine-tuning, and preservation of speaker identity across overlapping speech or rapid turn-taking scenarios.
In multispeaker ASR, special tokens (e.g., {<|spk0|>}) are injected at segment or word boundaries in accordance with speaker change points, maintaining the sorted correspondence between diarization and transcript serialization.
4. Streaming Sortformer and Speaker Cache Mechanism
Streaming Sortformer extends Sortformer to online, low-latency applications by utilizing an Arrival-Order Speaker Cache (AOSC). The AOSC orders per-frame acoustic embeddings for each speaker by arrival time, dynamically updating cache contents based on model probabilities and recency. At each inference step, embeddings from the cache, frames in a FIFO context queue, and current input are concatenated for prediction. The cache update is governed by scoring:
Frames are retained or pruned based on , with silence embeddings appended and recency/speaker coverage enforced. This mechanism enables consistent speaker assignment without attractors or explicit permutation alignment across streaming segments.
FIFO buffering compensates for context loss in short chunks, and cache update handles speaker continuity when audio spans chunk boundaries.
5. Experimental Results and Comparative Analysis
Sortformer has been evaluated across standard benchmark datasets (DIHARD III, CALLHOME, CH109) and realistic synthetic corpora (LibriConvo) (Gedeon et al., 27 Oct 2025). In LibriConvo, Sortformer (diar_sortformer_4spk-v1) was evaluated without dataset-specific fine-tuning and achieved the following DER:
| Model | Validation DER (%) | Test DER (%) |
|---|---|---|
| Pyannote | 25.6 | 24.4 |
| Sortformer | 12.9 | 11.1 |
Sortformer more than halves DER versus the pyannote pipeline, with a narrower DER distribution across recordings, indicating robustness in overlapping, rapid turn-taking dialogues. In broader benchmarks (Park et al., 10 Sep 2024), Sortformer using hybrid loss matches or outperforms EEND-GLA/Large and EEND-EDA on both 2-speaker and multi-speaker subsets. In multispeaker ASR, Sortformer supervision yields substantial improvements in both WER and cpWER, and fine-tuning the diarization module further boosts ASR accuracy.
Streaming Sortformer maintains competitive DER at low latencies (as low as 0.32s chunk) and does not degrade substantially versus offline versions, confirming its robustness for real-time applications (Medennikov et al., 24 Jul 2025).
6. Implementation and Applicability
Implementation is performed using the NVIDIA NeMo framework, with code and checkpoints available (Park et al., 10 Sep 2024). Base encoders are pre-trained on multilingual Mel-spectrogram features; model sizes typically exceed 100M parameters. Training for Sortformer involves large-scale mixtures of real and synthetic data (e.g., 5150h simulated, 2030h real speech), with arrival-time ordering enforced in both offline and streaming tasks. Streaming Sortformer additionally employs cache permutation augmentation to boost robustness.
Sortformer offers plug-and-play speaker supervision for ASR via sinusoidal kernels, straightforward token alignment through speaker-ordered serialization, and compatibility with adapter-based fine-tuning for efficient task specialization.
7. Impact and Future Directions
Sortformer advances the state-of-the-art for speaker diarization and multispeaker ASR by resolving the permutation problem through deterministic arrival-time ordering, improving training efficiency, and simplifying integration into downstream systems. Its robustness to overlapping speech and conversational dynamics is validated in diverse datasets, and streaming extensions enable low-latency, real-time speaker tracking without the need for complex attractors or explicit permutation mechanisms. Sortformer establishes a high baseline for diarization, with substantial improvements in DER and ASR metrics, and its architecture is extensible to joint supervision and large-scale, realistic deployment.
Sortformer's plug-and-play design and released implementation facilitate scalable research and practical application in multi-speaker audio analytics, conversational AI, and speech transcription pipelines.