VoxCeleb Dataset for Speaker Recognition

Updated 26 September 2025

VoxCeleb is a large-scale audio-visual dataset designed for robust speaker recognition tasks in unconstrained, 'in the wild' scenarios.
It employs a fully automated curation pipeline leveraging computer vision and deep learning to extract diverse speaker utterances from YouTube videos.
The dataset underpins various applications, including speaker verification, diarisation, paralinguistics, and deepfake detection, with established benchmarks and evaluation protocols.

VoxCeleb is a large-scale, publicly available, audio-visual dataset for speaker identification, verification, and diarisation, curated to provide 'in the wild' scenarios that include natural variability in background noise, acoustic conditions, and diverse speaker demographics. It was developed through a fully automated pipeline relying on computer vision and deep learning, enabling collection of hundreds of thousands to millions of utterances from thousands of celebrities appearing in open-source YouTube videos. The later VoxCeleb2 dataset expanded both the number of speakers and the volume of real-world audio data, supporting a wide range of research in robust speaker recognition under unconstrained conditions.

1. Data Acquisition and Automated Curation Pipeline

The VoxCeleb data collection pipeline is fully automated and leverages both audio and visual modalities to ensure that only segments featuring the target speaker speaking are retained. The workflow can be abstracted as follows:

Person of Interest (POI) Selection: Candidate identities are extracted from established celebrity face datasets (e.g., VGG Face, VGGFace2). VoxCeleb1 targeted 2,622 initial POIs, while VoxCeleb2 expanded coverage to over 6,000.
Video Downloading: For each POI, the system automatically downloads up to 100 top-ranked YouTube videos using queries of the pattern “[POI name] interview”, maximizing instances where the subject speaks.
Face Tracking and Detection: Video frames are processed using HOG-based or SSD-based face detectors, with subsequent face tracking (position-based or region-of-interest overlap trackers) to link detections into tracks.
Shot Boundary Detection: Color histogram comparison is used to detect shot changes, aiding in segment organization.
Active Speaker Verification: SyncNet, a two-stream synchronization CNN, assesses audio-video correspondence by correlating lip movement with speech, filtering out dubbed or misaligned content.
Facial Verification: A deep CNN (VGG-16/ResNet-50 pretrained on face datasets) verifies that active speaker faces match the POI.
Duplicate Removal and Metadata Enrichment: Near-duplicate utterances are pruned with embedding similarity thresholds. Nationality, gender, and (for some enrichments) age and height metadata are scraped from knowledge bases, with additional demographic balancing applied.
Dataset Statistics: VoxCeleb1 contains over 100,000 utterances for 1,251 speakers. VoxCeleb2 scales to over 1 million utterances from 6,112 speakers, with 145+ nationalities, balanced gender, wide age ranges, and extensive acoustic/linguistic diversity.

2. Dataset Structure, Variants, and Derived Resources

VoxCeleb is structured as a set of audio segment files with unique speaker identities and metadata that includes demographic and session labels. Both datasets are characterized by:

In the Wild Variability: Recordings span interviews, live events, red carpets, outdoor environments, studio sessions, and handheld devices, with substantial background noise, real conversational variability, multiple accents, and channel effects.
Session/Video Information: Each utterance is indexed by its source video, which enables parsing of environmental factors and supports environment-adversarial training (Chung et al., 2019).
Extension Datasets: VoxCeleb has inspired or directly enabled several notable derivative resources:
- Age, Gender, and Height Enrichments: Metadata augmentation for paralinguistic research (Hechmi et al., 2021, Kacprzak et al., 16 Oct 2024).
- Language/Region-specific Subsets: VoxCeleb-ESP (Spanish), EACeleb (East Asian), and others proposing new evaluation points for cross-lingual generalization (Caulley et al., 2022, Labrador et al., 2023).
- Privacy-friendly Variants: SynVox2, a synthetic, anonymized release using content/f0 disentanglement and neural vocoding to address privacy and fairness (Miao et al., 2023).
- Expressive Speech and Nonverbal Vocalization: NonverbalTTS includes nonverbal vocalizations and emotion annotation derived from the base data (Borisov et al., 17 Jul 2025).

A summary table of key datasets is below:

Dataset	Speakers	Utterances	Notable Characteristics
VoxCeleb1	1,251	~100,000	Balanced gender, >36 nationalities, unconstrained
VoxCeleb2	6,112	>1,000,000	145+ nationalities, 29% US, 71% non-US
EACeleb	800+	~	East Asian focus, fast face-tracking pipeline
VoxCeleb-ESP	160	2,400	Spanish speakers, regional/age diversity
SynVox2	6,112	~1,000,000	Synthetic, privacy-preserving data

3. Baseline Systems, Models, and Training Protocols

Baseline systems for speaker identification and verification on VoxCeleb have evolved from classical statistical models to deep neural architectures:

Statistical Baselines: GMM-UBM and i-vector + PLDA systems, using MFCCs or FBanks, served as initial reference points.
CNN-based Architectures: The original baseline used a VGG-M derived CNN with temporal pooling (average or log-sum-exp) to accommodate variable-length spectrogram inputs. Spectrograms are computed with a 25 ms Hamming window, 10 ms step; features are mean and variance normalized per frequency bin.
Architectural Innovations:
- CNNs replace standard fully-connected layers (e.g., fc6) with frequency-dominant and subsequent temporal average pooling layers, reducing parameter count and mitigating overfitting.
- Training employs random 3-second crops for augmentation. Hard negative mining is used in Siamese or contrastive setups to enhance embedding discrimination.
Training Losses and Metrics:
- Identification uses cross-entropy loss (multi-class). Verification leverages Siamese/contrastive losses (e.g., contrastive with margin $\alpha$ ).
- Detection Cost Function (DCF) and Equal Error Rate (EER) are core metrics:
$C_{\textrm{det}} = C_{\textrm{miss}} P_{\textrm{miss}} P_{\textrm{tar}} + C_{\textrm{fa}} P_{\textrm{fa}} (1 - P_{\textrm{tar}})$

with $P_{\textrm{tar}} = 0.01, C_{\textrm{miss}} = C_{\textrm{fa}} = 1.0$ (VoxCeleb1 baseline). - State-of-the-art supervised models (e.g., ResNet34/50, ECAPA-TDNN, Res2Net) and self-supervised models (wav2vec, WavLM) are routinely benchmarked in later VoxSRC challenge tracks (Liu et al., 2022, Torgashov et al., 2023).
Results and Performance Benchmarks:
- VoxCeleb1: Top-1 ID accuracy 80.5%, top-5 92.1% (CNN). Verification EER as low as 7.8% (Siamese CNN) outperforming classical systems.
- VoxCeleb2: Deep ResNet-50 with test-time augmentation achieves 3.95% EER (Chung et al., 2018). VoxSRC challenge systems exhibit EER <1.5% and minDCF ≪ 0.1 on evaluation sets given advanced data augmentation and score fusion/calibration (Liu et al., 2022, Torgashov et al., 2023).

4. Applications, Evaluation Protocols, and Downstream Benchmarks

The VoxCeleb datasets serve as the unified benchmark for a broad spectrum of research and operational evaluation:

Speaker Verification: Standardized pairwise identification protocols (is $x_1$ same speaker as $x_2$ ?) with EER, minDCF. Significant for device unlocking, security, and forensics.
Speaker Identification: Multi-class task (classify which of $N$ identities a segment belongs to), supporting applications in surveillance and indexing.
Speaker Diarisation: "Who spoke when?" segmentation on multi-speaker, real-world audio, measured by Diarisation Error Rate (DER) and Jaccard Error Rate (JER) (Ghahabi et al., 2020, Kim et al., 2021).
Open-Set Identification: VoxWatch establishes a public benchmark for open-set speaker identification—detecting presence or absence from a large "watchlist", emphasizing the false-alarm problem as watchlist size increases, and showing the importance of score calibration/fusion (Peri et al., 2023).
Paralinguistic Tasks: Extensions for age, gender, and height estimation utilize metadata-enriched VoxCeleb as a training and evaluation standard (Hechmi et al., 2021, Kacprzak et al., 16 Oct 2024). NonverbalTTS leverages VoxCeleb-derived audio for nonverbal vocalization and emotional TTS benchmarks (Borisov et al., 17 Jul 2025).
Speech Deepfake/Anti-Spoofing: SpoofCeleb applies VoxCeleb1 post-processed by rigorous automated segmentation and enhancement, facilitating robust training/testing of deepfake detection and spoof-resistant ASV models under in-the-wild conditions (Jung et al., 18 Sep 2024).

Evaluation on VoxCeleb-based challenge sets follows strict data partitioning to prevent training on test identities, with protocols for both “fixed” (VoxCeleb-only training) and “open” (allowing extra data) conditions (Chung et al., 2019, Huh et al., 27 Aug 2024).

5. Limitations, Privacy, Fairness, and Future Directions

While VoxCeleb and its derivatives are the de facto standard for speaker recognition, several limitations have been identified:

Privacy/Ethics: Collection from public celebrity media without explicit consent has raised concerns. The restricted availability of VoxCeleb2 and the development of privacy-friendly synthetic variants (e.g., SynVox2 using OHNN anonymization and HiFi-GAN vocoding) aims to mitigate identity risk but introduces utility/fairness trade-offs (EER for cross-domain verification increases from ~1.3% to ~7%) (Miao et al., 2023).
Language and Demographic Coverage: Despite expansion, under-representation of low-resource languages and accent/dialect variation limits utility for global deployment. Tailored subsets (e.g., EACeleb, VoxCeleb-ESP) and multilingualization in challenge test sets address these gaps incrementally (Caulley et al., 2022, Labrador et al., 2023, Huh et al., 27 Aug 2024).
Annotation Noise and Non-Speaker Segments: Even with advanced pipelines, label noise persists; video-free weak supervision and diarization-based methods are proposed to leverage more data and mitigate reliance on facial identity (Barahona et al., 3 Oct 2024).
Overfitting, Short-duration, and Overlapped Speech: Achieving robust performance with minimal data or under heavy overlap remains challenging, spurring research into self-supervised and sequence-to-sequence diarisation models.
Anti-spoofing and Domain Adaptation: With advances in voice conversion/deepfakes, resilience against AI-generated speech is now an active research domain. SpoofCeleb and VoxSRC domain adaptation tracks focus on these issues.
Evaluation and Reproducibility: Persistent test sets supporting year-over-year comparison, open release of protocols (trial lists, scored pairs), and community-maintained leaderboards have become best practices.

6. Impact and Community Adoption

VoxCeleb’s introduction precipitated a paradigm shift in speaker recognition research, enabling rapid advances in deep learning methods, the design of robust neural speaker embedding extractors, and the acceleration of benchmark challenges such as VoxSRC (2019–2023), which have driven error rates to near saturation even under unconstrained acoustic conditions (Huh et al., 27 Aug 2024). The dataset’s incorporation into derivative resources for synthesis, anti-spoofing, paralinguistics, and downstream multimodal analytics (e.g., emotion, nonverbal content, and biometric profiling) continues to support a wide array of research agendas.

Due to its breadth, complexity, and open evaluation ecosystem, VoxCeleb is now foundational in the paper and advancement of automatic speaker recognition, verification, diarization, and related fields.