VoxCeleb2 Dataset Overview
- VoxCeleb2 is a large-scale, unconstrained audio-visual dataset featuring over one million utterances from 6,000+ speakers drawn from diverse real-world scenarios.
- It employs a fully automated pipeline with face detection, active speaker verification, and metadata augmentation to ensure accurate speaker annotation.
- Baseline models trained on VoxCeleb2 achieve significantly lower error rates, validating its impact on advancing deep speaker embedding techniques.
The VoxCeleb2 dataset is a large-scale audio-visual corpus designed for robust, unconstrained speaker recognition. Curated by Chung et al. (2018), it supersedes previous speaker verification corpora in both scale and diversity, offering over one million utterances from more than 6,000 speakers. The dataset underpins advances in deep speaker embedding architectures and training methodologies, enabling significant reductions in error rates for automatic speaker verification systems under real-world, noisy conditions (Chung et al., 2018).
1. Dataset Composition and Collection Pipeline
VoxCeleb2 comprises 6,112 identities (61% male) and 1,128,246 utterances, totaling approximately 2,442 hours of speech drawn from 150,480 unique YouTube videos. The corpus spans 145 nationalities, with recordings sourced from a diverse array of scenarios such as TV interviews, press conferences, speeches, sports events, and music videos. The data is characterized by real-world acoustic and visual variability, including background chatter, overlapping speech, laughter, music, channel artifacts, and visual occlusions (Chung et al., 2018).
The acquisition pipeline is fully automated:
- POI Selection: Initiate from an initial pool of over 9,000 person-of-interest (POI) names, filtered for overlap with the VGGFace2, VoxCeleb1, and SITW datasets.
- Video Retrieval: Download up to 100 top-ranked “<name> interview” YouTube clips per POI.
- Face Detection and Tracking: Apply a single-shot SSD-based detector and multi-pose ROI overlap tracking.
- Face Verification: Employ ResNet-50 (VGGFace2-pretrained) to confirm face identities.
- Active Speaker Verification: Use multi-view SyncNet to exploit audio-mouth motion correlation and reject dubbed or voice-over content.
- Duplicate Removal: Compute 1024-dimensional CNN embeddings and drop duplicates (pairs with Euclidean distance < 0.1).
- Label Augmentation: Crawl nationality metadata from Wikipedia.
Comprehensive metadata accompanies each utterance: YouTube video ID, face-track boundaries, timestamps, speaker ID, gender, nationality tag, utterance start/end times, and face bounding boxes.
2. Organization, Preprocessing, and Access
The dataset is partitioned as follows:
- Development Set: 5,994 speakers, 145,569 videos, and 1,092,009 utterances.
- Test Set: 118 speakers, 4,911 videos, and 36,237 utterances. No identities overlap between VoxCeleb2 and VoxCeleb1/SITW in the development split (Chung et al., 2018); thus, cross-corpus evaluation is feasible.
The recommended audio format is lossless mono WAV sampled at 16 kHz. Each segment’s corresponding video frames are also provided for face-track associations. Preprocessing protocols mandate conversion to 512-bin magnitude spectrograms via a 25 ms Hamming window and 10 ms hop (512-point FFT), mean and variance normalization per frequency bin, and (optionally) voice activity detection to curtail silence. For deep learning paradigms, 3 s spectrogram crops are randomly sampled during training.
3. Baseline Architectures and Training Methodologies
Canonical supervised pipelines utilize 3 s (512×300) spectrograms as input. Tested “trunk” architectures include a VGG-M-inspired CNN and deeper ResNet-34 and ResNet-50 models. Key elements include:
- For VGG-M, “fc6” is modified to a 9×1 convolution, followed by 1×n average pooling, granting temporal invariance.
- Residual networks apply standard ResNet blocks to the spectrogram.
Losses employed include:
- Softmax Cross-Entropy (identification pre-training):
- Additive Margin Softmax (AM-Softmax) loss:
- Contrastive Loss (for embedding fine-tuning):
with for same-speaker pairs, is the Euclidean distance, and a margin.
A two-stage training regimen is adhered to: pre-train the trunk plus a softmax layer (multi-way classification) and then replace the classification head with a 512-dimensional embedding, fine-tuning the full network with contrastive loss. Hard negative mining selects the top 1% hardest impostors per epoch.
Test-time augmentation is performed by (1) global average pooling, (2) averaging features of ten 3 s crops, or (3) averaging all pairwise distances between ten crops per utterance.
4. Benchmarking, Evaluation Protocols, and Results
Performance is predominantly measured using Equal Error Rate (EER) and minimum Detection Cost Function (minDCF), with (, ).
Major benchmarks include the original VoxCeleb1 test set, and the newly introduced VoxCeleb1-E (1,251 speakers, 581,480 pairs) and VoxCeleb1-H (same-gender/nationality, 552,536 pairs). Key results (Chung et al., 2018):
| Model | Training Data | minDCF | EER |
|---|---|---|---|
| i-vectors+PLDA | VoxCeleb1 | 0.73 | 8.80% |
| VGG-M (contrastive) | VoxCeleb1 | 0.71 | 7.80% |
| VGG-M (contrastive) | VoxCeleb2 | 0.609 | 5.94% |
| ResNet-34 (aug3) | VoxCeleb2 | 0.549 | 4.83% |
| ResNet-50 (aug3) | VoxCeleb2 | 0.429 | 3.95% |
For the VoxCeleb1-E and VoxCeleb1-H protocols:
| Model | Benchmark | minDCF | EER |
|---|---|---|---|
| ResNet-50 (aug3) | VoxCeleb1-E | 0.524 | 4.42% |
| ResNet-50 (aug3) | VoxCeleb1-H | 0.673 | 7.33% |
Deeper residual architectures and test-time augmentation offer notable, though diminishing, gains.
5. Low-Resource Subsets and Session Variability
Training with limited data motivates systematic subsampling of VoxCeleb2 (Vaessen et al., 2022). Three ~50,000-utterance subsets with distinct speaker and session distributions were defined for benchmarking under resource constraints:
| Subset | Speakers | Sessions | Utterances | Sessions/speaker (μ) | Utterances/session (μ) |
|---|---|---|---|---|---|
| vox2 (full dev) | 5,994 | 136,632 | 1,068,871 | 22.8 | 7.8 |
| tiny-few-speakers | 100 | 5,066 | 49,400 | 50.7 | 9.8 |
| tiny-few-sessions | 5,994 | 6,275 | 47,952 | 1.0 | 7.6 |
| tiny-many-sessions | 5,994 | 46,813 | 47,952 | 7.8 | 1.0 |
These partitions maintain the original utterance length distribution (mean 7.8 s, SD 5.2 s). The construction protocol balances gender and maximizes either intra-speaker variability or speaker diversity. Evaluation on VoxCeleb1 demonstrates that both the number of speakers and session diversity significantly affect downstream speaker recognition performance, especially under constrained training scenarios.
6. Synthetic and Privacy-Preserving Successors
Owing to privacy/legal constraints, VoxCeleb2 is no longer officially distributed. Efforts have been made to generate synthetic variants such as SynVox2, which is derived by anonymizing and vocoding authentic VoxCeleb2 recordings through a three-stage pipeline (Miao et al., 2023):
- Content/prosody extraction (YAAPT, HuBERT content embeddings)
- Speaker representation (ECAPA-TDNN encoder), with either utterance-level or averaged, per-speaker "pseudo-speaker" vectors
- OHNN-based (orthogonal Householder) anonymization to scramble speaker embeddings, followed by HiFi-GAN vocoding
- Post-processing including MUSAN noise, room reverberation, or background replacement (DeepFilterNet-inspired)
Empirical unlinkability is quantified via cross-dataset EER, reaching 27–35% (best: SynVox2-OHNN-aug-utt, 34.76%). However, models trained on SynVox2 exhibit utility degradation: EER increases from 1.33% (authentic) to 7.38% (best synthetic, utterance-level, augmented) on VoxCeleb1. Accent and gender fairness remains mostly preserved by relative ranking, though Fairness Discrepancy Rate (FDR) drops slightly for underrepresented groups (Miao et al., 2023).
7. Significance, Limitations, and Future Directions
The VoxCeleb2 dataset established new standards for large-scale, unconstrained speaker verification, enabling deep embedding systems to achieve state-of-the-art error rates and generalization in noisy, realistic conditions. The design—spanning diverse demographics, spontaneous in-the-wild content, and rigorous, automated audio-visual verification—directly catalyzed progress in robust speaker recognition.
Key insights:
- Training on >1 M utterances yields ~2× lower EER versus prior corpora.
- Ethnic, age, and acoustic diversity directly improves generalization.
- Deeper CNNs with two-stage training and hard negative mining converge more stably and deliver superior discriminability.
- Test-time temporal augmentation provides diminishing but measurable accuracy improvements.
Current challenges concern privacy, domain generalization, and inter-speaker variation, especially for synthetic successors. Research directions include integrating formal privacy guarantees (e.g., differential privacy), adversarial anonymization objectives, and hybrid synthetic-real datasets to maintain both privacy and utility amid evolving legal requirements (Miao et al., 2023).