VoxSRC Challenge: Speaker Recognition Benchmarks

Updated 5 March 2026

VoxSRC Challenge is an annual benchmark that standardizes data, metrics, and evaluation protocols for speaker recognition and diarisation under real-world conditions.
The challenge drives innovations in deep embedding architectures, margin-based loss functions, and self-supervised learning techniques to enhance system robustness.
Year-over-year progress shows significant reductions in error rates, highlighting improvements in feature extraction, score calibration, and system fusion strategies.

The VoxCeleb Speaker Recognition Challenge (VoxSRC) is an annual, benchmark-driven evaluation of automatic speaker recognition and diarisation systems operating under real-world, unconstrained conditions. Since its inception in 2019, VoxSRC has provided a unified public platform with standardized data, metrics, and evaluation protocols, catalyzing rapid advances in "in the wild" speaker verification and diarisation. The challenge encompasses various tracks, including closed/open training conditions, self-supervised and semi-supervised adaptation, and diarisation, with each track reflecting distinct methodological constraints and research objectives. VoxSRC has served as a principal driver of innovation in deep embedding architectures, loss functions, domain adaptation, and multi-stage training-adaptation paradigms in speaker recognition research (Huh et al., 2024).

1. Challenge Structure and Tracks

VoxSRC is composed of multiple tracks that typically include:

Speaker Verification (Closed condition): Training restricted to the publicly released VoxCeleb2 development set, disallowing external data. This setting isolates pure algorithmic improvements (&&&1&&&, Nagrani et al., 2020, Huh et al., 2023).
Speaker Verification (Open condition): Training with any public or proprietary data, except the shared blind test set (Chung et al., 2019, Nagrani et al., 2020, Huh et al., 2023).
Self-Supervised and Semi-Supervised Tracks: Introduced in 2020–2022, these tracks prohibit the use of speaker labels or limit labels to a small, target domain subset; systems must employ self-supervised learning, pseudo-labeling, or domain adaptation (Nagrani et al., 2020, Huh et al., 2023).
Speaker Diarisation (Open): All tracks allow use of any public/internal data (excluding test) for diarisation, focusing on segmenting and clustering multi-speaker audio (Nagrani et al., 2020, Brown et al., 2022, Huh et al., 2023).

A new test set is released annually for leaderboard evaluation, but a subset of the VoxSRC2019 test set is consistently re-used to enable longitudinal performance comparisons (Huh et al., 2024).

2. Task Definition and Evaluation Metrics

VoxSRC defines two core tasks: speaker verification (the binary same/different decision based on segment-pairs) and speaker diarisation ("who spoke when" labeling in multi-speaker audio).

Speaker Verification Metrics:

Equal Error Rate (EER): The threshold at which the false acceptance and false rejection rates are equal.
Minimum Detection Cost Function (minDCF):

$\mathrm{minDCF} = \min_\theta \left[ C_{\mathrm{miss}} P_{\mathrm{miss}}(\theta) P_{\mathrm{tar}} + C_{\mathrm{fa}} P_{\mathrm{fa}}(\theta) (1-P_{\mathrm{tar}}) \right]$

with $C_{\mathrm{miss}} = C_{\mathrm{fa}} = 1$ , $P_{\mathrm{tar}}=0.05$ in most challenge years (Chung et al., 2019, Nagrani et al., 2020, Brown et al., 2022, Huh et al., 2023).

Speaker Diarisation Metrics:

Diarisation Error Rate (DER): The sum of missed speech, false alarms, and speaker confusions, normalized by reference speech time, with a 0.25 s "forgiveness collar" (Nagrani et al., 2020):

$\mathrm{DER} = \frac{T_\mathrm{miss} + T_\mathrm{fa} + T_\mathrm{spk}}{T_\mathrm{total}} \times 100\%$

Jaccard Error Rate (JER): One minus the average Jaccard index over optimally mapped reference and predicted speakers (Huh et al., 2023):

$\mathrm{JER} = 1 - \frac{1}{S} \sum_{s=1}^S \frac{|R_s\cap H_{\pi(s)}|}{|R_s\cup H_{\pi(s)}|}$

3. Benchmark Datasets

VoxSRC provides all data required for training, validation, and blind evaluation each year:

VoxCeleb1/2: The main training resource, extracted from YouTube interviews, covering >5,000 speakers and over one million utterances (Chung et al., 2019, Nagrani et al., 2020, Huh et al., 2023).
VoxMovies: Out-of-domain movie clips were introduced to test generalization beyond interviews starting in 2020 (Nagrani et al., 2020).
VoxConverse: Multi-speaker YouTube data for diarisation, with average 4–6 speakers per file and 3–30% overlap (Nagrani et al., 2020, Brown et al., 2022).
CN-Celeb: The domain adaptation tracks (2022–2023) introduced non-English (Mandarin) speech (Huh et al., 2023).
Validation and Test Sets: Segmented, trial-pair lists are published for validation and a hidden test set is used for final scoring, with increased trial complexity across years (e.g., "hard positives" from the same speaker over substantial covariate shift; "hard negatives" from different speakers in same-channel) (Huh et al., 2023).

4. System Architectures and Methodologies

A consensus state-of-the-art (SOTA) pipeline has emerged from challenge winners, generally involving:

Feature Extraction: 64/80/96-dimensional log-mel spectrograms or MFCCs; often with pre-emphasis, mean-variance normalization, and per-utterance normalization.
Embedding Backbone: Deep CNN or TDNN variants:
- ResNet family (ResNet34/50~518), RepVGG, SE-ResNet, Res2Net (Kwon et al., 2020, Xiao et al., 2020, Huh et al., 2024, Huh et al., 2023).
- ECAPA-TDNN and variants emphasizing grouped convolutions, channel attention, and multi-layer feature fusion.
- MQMHA/ASP pooling to aggregate frame-level representations (Zheng et al., 2023).
Loss Functions: Margin-based softmax objectives dominate:
- AM-Softmax (CosFace): encourages cosine-margin separability between speakers.
- AAM-Softmax (ArcFace): angular margin on class angle increases the intra/inter-class margin.
- Composite losses: Mix of softmax/cross-entropy and metric learning (e.g., Angular Prototypical Loss, Inter-TopK penalties, Sub-center loss) (Kwon et al., 2020, Chen et al., 2022).
Training Recipes:
- Extensive data augmentation: MUSAN noise/music/babble, RIR convolution, speed/pitch perturbation, sometimes SpecAugment (Chen et al., 2022).
- Large batch sizes (up to 200+), mixed-precision for scaling, cyclical learning rates or ReduceLROnPlateau.
- Two- or three-stage curriculum: short-segment initial training, long-segment fine-tuning with larger margin (Zheng et al., 2023, Chen et al., 2022).
Scoring and Calibration:
- Cosine similarity and/or PLDA back-ends, now often superseded by length/AS-norm (adaptive symmetric score normalization) with large impostor cohorts.
- Quality-aware post-processing: QMF calibration, consistency factor (CMF) (Zheng et al., 2023, Huh et al., 2023).
- Multi-system fusion via logistic regression fusers, with late normalization and calibration (Xiang, 2020).

5. Diarisation System Design

Diarisation systems follow a modular pipeline, typically:

Preprocessing: Neural or multi-stream VAD (e.g., pyannote.audio, ResNet-LSTM hybrids, entropy-fused streams) (Xiao et al., 2020, Tevissen et al., 2023).
Speaker Embedding: Sliding-window or uniform-segment embedding extractors, mostly ResNet or ECAPA backbones (Xiao et al., 2020, Thienpondt et al., 2021, Park et al., 2023).
Clustering: Agglomerative hierarchical clustering (AHC), spectral clustering (NMESC), Bayesian HMM/VBx, often with re-clustering and overlap adjustment (Wang et al., 2021, Thienpondt et al., 2021, Huh et al., 2023).
Overlap Handling: Separate overlap speech detection (OSD, TS-VAD), neural resegmentation, or blockwise CSS + leakage filtering (Xiao et al., 2020, Tevissen et al., 2023, Park et al., 2023).
System Fusion: Overlap-aware methods (DOVER, DOVER-Lap) fuse hypotheses from systems at different segment/hop granularities (Xiao et al., 2020, Park et al., 2023).
Performance: DER <5% (top systems 2021–2023), with steady improvement as clustering, VAD, and overlap handling advanced (Brown et al., 2022, Huh et al., 2023).

6. Year-over-Year Progress and Impact

Tracking performance on the fixed VoxSRC2019 test set, SOTA EERs dropped from 1.42% (2019) to 0.75% (2020) to 0.57% (2021); open track SOTA fell from 1.26% (2019) to 0.47% (2023). Diarisation DER advanced from 5.07% (2021) to 3.74% (2023) as system fusion and advanced VAD architectures became common (Huh et al., 2024, Huh et al., 2023). Harder test sets with cross-age, cross-lingual, and cross-microphone trials were progressively introduced, ensuring the challenge remained a driver of algorithmic and real-world improvements.

VoxSRC has been pivotal in establishing the dominance of deep embedding architectures with margin-based losses, self-supervision and pseudo-labeling for data-sparse domains, and score normalization/fusion techniques. It has also illuminated ongoing open problems: anti-spoofing, overlap-robust diarisation, data diversity and fairness, and the transition to transformer-based or end-to-end joint diarisation-recognition paradigms (Huh et al., 2024, Huh et al., 2023, Brown et al., 2022).

7. Open Challenges and Future Directions

Key research directions highlighted by VoxSRC and its participants include:

Antispoofing/Adversarial Robustness: Addressing synthetic and replay attacks.
Extreme Noise/Overlap: Robustness to highly overlapped, short, or far-field speech remains challenging, especially for end-to-end systems.
Fairness and Diversity: Data remains biased toward English and celebrity demographics; broadening coverage and demographic fairness is a priority (Huh et al., 2024).
Transformer and SSL Front-Ends: Extending the gains from self-supervised representations (e.g., HuBERT, WavLM, UniSpeech, XLS-R) to end-to-end models, including diarisation (Huh et al., 2023).
End-to-End Architectures: Joint optimization of VAD, speaker segmentation, and clustering (e.g., EEND, blockwise segmentation + clustering) (Huh et al., 2024).
Integration with ASR and Downstream Tasks: Joint diarisation and ASR, privacy-preserving adaptation, and explainability.

VoxSRC continues to serve as the principle forum for measuring and advancing robust, fair, and generalizable speaker recognition and diarisation systems in the research community.