Papers
Topics
Authors
Recent
Search
2000 character limit reached

VoxSRC Challenge: Speaker Recognition Benchmarks

Updated 5 March 2026
  • VoxSRC Challenge is an annual benchmark that standardizes data, metrics, and evaluation protocols for speaker recognition and diarisation under real-world conditions.
  • The challenge drives innovations in deep embedding architectures, margin-based loss functions, and self-supervised learning techniques to enhance system robustness.
  • Year-over-year progress shows significant reductions in error rates, highlighting improvements in feature extraction, score calibration, and system fusion strategies.

The VoxCeleb Speaker Recognition Challenge (VoxSRC) is an annual, benchmark-driven evaluation of automatic speaker recognition and diarisation systems operating under real-world, unconstrained conditions. Since its inception in 2019, VoxSRC has provided a unified public platform with standardized data, metrics, and evaluation protocols, catalyzing rapid advances in "in the wild" speaker verification and diarisation. The challenge encompasses various tracks, including closed/open training conditions, self-supervised and semi-supervised adaptation, and diarisation, with each track reflecting distinct methodological constraints and research objectives. VoxSRC has served as a principal driver of innovation in deep embedding architectures, loss functions, domain adaptation, and multi-stage training-adaptation paradigms in speaker recognition research (Huh et al., 2024).

1. Challenge Structure and Tracks

VoxSRC is composed of multiple tracks that typically include:

  • Speaker Verification (Closed condition): Training restricted to the publicly released VoxCeleb2 development set, disallowing external data. This setting isolates pure algorithmic improvements (&&&1&&&, Nagrani et al., 2020, Huh et al., 2023).
  • Speaker Verification (Open condition): Training with any public or proprietary data, except the shared blind test set (Chung et al., 2019, Nagrani et al., 2020, Huh et al., 2023).
  • Self-Supervised and Semi-Supervised Tracks: Introduced in 2020–2022, these tracks prohibit the use of speaker labels or limit labels to a small, target domain subset; systems must employ self-supervised learning, pseudo-labeling, or domain adaptation (Nagrani et al., 2020, Huh et al., 2023).
  • Speaker Diarisation (Open): All tracks allow use of any public/internal data (excluding test) for diarisation, focusing on segmenting and clustering multi-speaker audio (Nagrani et al., 2020, Brown et al., 2022, Huh et al., 2023).

A new test set is released annually for leaderboard evaluation, but a subset of the VoxSRC2019 test set is consistently re-used to enable longitudinal performance comparisons (Huh et al., 2024).

2. Task Definition and Evaluation Metrics

VoxSRC defines two core tasks: speaker verification (the binary same/different decision based on segment-pairs) and speaker diarisation ("who spoke when" labeling in multi-speaker audio).

Speaker Verification Metrics:

  • Equal Error Rate (EER): The threshold at which the false acceptance and false rejection rates are equal.
  • Minimum Detection Cost Function (minDCF):

minDCF=minθ[CmissPmiss(θ)Ptar+CfaPfa(θ)(1Ptar)]\mathrm{minDCF} = \min_\theta \left[ C_{\mathrm{miss}} P_{\mathrm{miss}}(\theta) P_{\mathrm{tar}} + C_{\mathrm{fa}} P_{\mathrm{fa}}(\theta) (1-P_{\mathrm{tar}}) \right]

with Cmiss=Cfa=1C_{\mathrm{miss}} = C_{\mathrm{fa}} = 1, Ptar=0.05P_{\mathrm{tar}}=0.05 in most challenge years (Chung et al., 2019, Nagrani et al., 2020, Brown et al., 2022, Huh et al., 2023).

Speaker Diarisation Metrics:

  • Diarisation Error Rate (DER): The sum of missed speech, false alarms, and speaker confusions, normalized by reference speech time, with a 0.25 s "forgiveness collar" (Nagrani et al., 2020):

DER=Tmiss+Tfa+TspkTtotal×100%\mathrm{DER} = \frac{T_\mathrm{miss} + T_\mathrm{fa} + T_\mathrm{spk}}{T_\mathrm{total}} \times 100\%

  • Jaccard Error Rate (JER): One minus the average Jaccard index over optimally mapped reference and predicted speakers (Huh et al., 2023):

JER=11Ss=1SRsHπ(s)RsHπ(s)\mathrm{JER} = 1 - \frac{1}{S} \sum_{s=1}^S \frac{|R_s\cap H_{\pi(s)}|}{|R_s\cup H_{\pi(s)}|}

3. Benchmark Datasets

VoxSRC provides all data required for training, validation, and blind evaluation each year:

  • VoxCeleb1/2: The main training resource, extracted from YouTube interviews, covering >5,000 speakers and over one million utterances (Chung et al., 2019, Nagrani et al., 2020, Huh et al., 2023).
  • VoxMovies: Out-of-domain movie clips were introduced to test generalization beyond interviews starting in 2020 (Nagrani et al., 2020).
  • VoxConverse: Multi-speaker YouTube data for diarisation, with average 4–6 speakers per file and 3–30% overlap (Nagrani et al., 2020, Brown et al., 2022).
  • CN-Celeb: The domain adaptation tracks (2022–2023) introduced non-English (Mandarin) speech (Huh et al., 2023).
  • Validation and Test Sets: Segmented, trial-pair lists are published for validation and a hidden test set is used for final scoring, with increased trial complexity across years (e.g., "hard positives" from the same speaker over substantial covariate shift; "hard negatives" from different speakers in same-channel) (Huh et al., 2023).

4. System Architectures and Methodologies

A consensus state-of-the-art (SOTA) pipeline has emerged from challenge winners, generally involving:

  • Feature Extraction: 64/80/96-dimensional log-mel spectrograms or MFCCs; often with pre-emphasis, mean-variance normalization, and per-utterance normalization.
  • Embedding Backbone: Deep CNN or TDNN variants:
  • Loss Functions: Margin-based softmax objectives dominate:
    • AM-Softmax (CosFace): encourages cosine-margin separability between speakers.
    • AAM-Softmax (ArcFace): angular margin on class angle increases the intra/inter-class margin.
    • Composite losses: Mix of softmax/cross-entropy and metric learning (e.g., Angular Prototypical Loss, Inter-TopK penalties, Sub-center loss) (Kwon et al., 2020, Chen et al., 2022).
  • Training Recipes:
    • Extensive data augmentation: MUSAN noise/music/babble, RIR convolution, speed/pitch perturbation, sometimes SpecAugment (Chen et al., 2022).
    • Large batch sizes (up to 200+), mixed-precision for scaling, cyclical learning rates or ReduceLROnPlateau.
    • Two- or three-stage curriculum: short-segment initial training, long-segment fine-tuning with larger margin (Zheng et al., 2023, Chen et al., 2022).
  • Scoring and Calibration:
    • Cosine similarity and/or PLDA back-ends, now often superseded by length/AS-norm (adaptive symmetric score normalization) with large impostor cohorts.
    • Quality-aware post-processing: QMF calibration, consistency factor (CMF) (Zheng et al., 2023, Huh et al., 2023).
    • Multi-system fusion via logistic regression fusers, with late normalization and calibration (Xiang, 2020).

5. Diarisation System Design

Diarisation systems follow a modular pipeline, typically:

6. Year-over-Year Progress and Impact

Tracking performance on the fixed VoxSRC2019 test set, SOTA EERs dropped from 1.42% (2019) to 0.75% (2020) to 0.57% (2021); open track SOTA fell from 1.26% (2019) to 0.47% (2023). Diarisation DER advanced from 5.07% (2021) to 3.74% (2023) as system fusion and advanced VAD architectures became common (Huh et al., 2024, Huh et al., 2023). Harder test sets with cross-age, cross-lingual, and cross-microphone trials were progressively introduced, ensuring the challenge remained a driver of algorithmic and real-world improvements.

VoxSRC has been pivotal in establishing the dominance of deep embedding architectures with margin-based losses, self-supervision and pseudo-labeling for data-sparse domains, and score normalization/fusion techniques. It has also illuminated ongoing open problems: anti-spoofing, overlap-robust diarisation, data diversity and fairness, and the transition to transformer-based or end-to-end joint diarisation-recognition paradigms (Huh et al., 2024, Huh et al., 2023, Brown et al., 2022).

7. Open Challenges and Future Directions

Key research directions highlighted by VoxSRC and its participants include:

  • Antispoofing/Adversarial Robustness: Addressing synthetic and replay attacks.
  • Extreme Noise/Overlap: Robustness to highly overlapped, short, or far-field speech remains challenging, especially for end-to-end systems.
  • Fairness and Diversity: Data remains biased toward English and celebrity demographics; broadening coverage and demographic fairness is a priority (Huh et al., 2024).
  • Transformer and SSL Front-Ends: Extending the gains from self-supervised representations (e.g., HuBERT, WavLM, UniSpeech, XLS-R) to end-to-end models, including diarisation (Huh et al., 2023).
  • End-to-End Architectures: Joint optimization of VAD, speaker segmentation, and clustering (e.g., EEND, blockwise segmentation + clustering) (Huh et al., 2024).
  • Integration with ASR and Downstream Tasks: Joint diarisation and ASR, privacy-preserving adaptation, and explainability.

VoxSRC continues to serve as the principle forum for measuring and advancing robust, fair, and generalizable speaker recognition and diarisation systems in the research community.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VoxSRC Challenge.