SpoofCeleb Corpus: Multi-Modal Spoofing
- SpoofCeleb Corpus is a multi-modal collection comprising large-scale datasets for speech deepfake detection, social media hoax identification, and face anti-spoofing.
- The speech dataset utilizes automated preprocessing with advanced TTS and VC systems, ensuring speaker and attack disjoint protocols for robust benchmarking.
- The text and face datasets feature detailed annotations and diverse conditions, providing realistic evaluation environments for early hoax detection and face spoof resistance.
The SpoofCeleb corpus refers to three distinct large-scale, publicly available datasets targeting different modalities and tasks under the umbrella of spoofing detection: (1) speech deepfake detection and spoof-robust speaker verification in the wild (Jung et al., 18 Sep 2024), (2) early social media hoax detection with a focus on celebrity death reports (Zubiaga et al., 2018), and (3) large-scale face anti-spoofing with comprehensive annotation (Zhang et al., 2020). Each resource addresses critical gaps in its respective field by enabling robust, reproducible research on manipulated or falsified content detection under real-world conditions.
1. SpoofCeleb for Speech Deepfake and Spoof-Robust ASV
SpoofCeleb (Jung et al., 18 Sep 2024) is a large-scale, in-the-wild speech corpus for benchmarking Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV). It is constructed by applying an automated, TTS-grade preprocessing pipeline to the VoxCeleb1 corpus (1,251 speakers), followed by the synthesis of spoofed utterances using 23 state-of-the-art TTS and voice conversion (VC) systems.
- Source Data & Preprocessing: Starting with VoxCeleb1 audio, the TITW-Easy pipeline automates ASR transcription and alignment (WhisperX), silence-based segmentation, heuristic filtering (excluding non-English, too short/long or noisy segments), and neural speech enhancement (DEMUCS). Only bona fide segments (DNSMOS "BAK" score ≥ 3.0) suitable for TTS training are retained, producing ~248,000 high-quality utterances with aligned transcripts. The entire process is scriptable and repeatable for other "in-the-wild" corpora.
- Spoof Generation: Twenty-three spoofing systems (acoustic+vocoder, end-to-end, neural-codec) are trained on these processed utterances. Attack IDs (A01–A23) capture architectural diversity and configurations (e.g., pretraining, embedding injection).
- Dataset Structure: The corpus totals 2,687,292 utterances (1,251 speakers), partitioned into train (2,540,421; A00+A01–A10), validation (55,741; A00+A06,A07,A11–A14), and evaluation splits (91,130; A00+A15–A23), with strict speaker and attack disjointness.
| Partition | # Files | Bona fide | Spoof Systems |
|---|---|---|---|
| Train | 2,540,421 | 1,171 | A00 + A01–A10 |
| Validation | 55,741 | 40 | A00 + A06,A07,A11–A14 |
| Evaluation | 91,130 | 40 | A00 + A15–A23 |
- Acoustic Conditions: The data inherits heterogeneous noise, reverberation, and device diversity from VoxCeleb1 (YouTube interviews), yielding UTMOS of 3.32 and DNSMOS 2.78 (MOS scales).
- Protocols & Benchmarks: Train/val/test sets have fully disjoint speakers and spoof types. No cross-validation is performed. Baselines include RawNet2 and AASIST for SDD, and various SASV configurations. Example metrics: SDD Eval EER = 1.12% (RawNet2), Eval EER = 2.37% (AASIST); SASV Eval a-DCF = 0.2902, SV-EER = 12.78%, SPF-EER = 5.00% when trained on SpoofCeleb.
- Access & Licensing: Dataset, code, and official protocols are at https://jungjee.github.io/spoofceleb (VoxCeleb1 licensing applies).
This corpus enables systematic evaluation of both SDD and SASV under realistic attack and recording conditions, filling a gap left by prior datasets constrained to clean, studio-quality or low-variability sources (Jung et al., 18 Sep 2024).
2. SpoofCeleb Corpus for Early Celebrity Death Hoax Detection
The SpoofCeleb corpus (Zubiaga et al., 2018)—in this context—designates a large-scale, semantically annotated collection of Twitter death reports ("RIP" events) for the study of early hoax detection and event veracity classification.
- Construction: Harvests all English "RIP" tweets (Jan 2012–Dec 2014; 94.2M messages) via the Twitter API. Candidate events are identified when "RIP" is followed by a Wikidata person name/alias and ≥50 such tweets appear on one day. A Wikidata snapshot (Jan 2015) is used for event-to-entity mapping.
- Annotation Scheme: Labels—{real, commemoration, fake}—are assigned by semi-automated heuristics exploiting Wikidata death records (±1 day tolerance), supervised by manual validation (Cohen’s κ = 0.982).
- Corpus Statistics: Comprises 4,007 labeled events (~13.3M tweets; 2,301 "real", 1,092 "commemoration", 614 "fake"), covering 3,066 distinct individuals. Class imbalance reflects real-world prevalence (15.3% fake).
- Temporal & Structural Features: Fake reports cluster over shorter durations (median 50 hours) but can spread virally.
- Data Organization: Two files—
reports.csv(event metadata, labels, intervals) andtweets.csv(tweet IDs, timestamps)—enable reproducible experiments. Hydration via Twitter API is required. - Early Detection Protocol: Real-time scenario simulated by releasing tweet batches at time intervals minutes post-first tweet. Combined social (engagement stats) and textual features (multiw2v, class-specific Word2Vec) drive a logistic regression classifier. Sliding-window and ablation analyses assess temporal/feature importance.
- Performance Benchmarks: At 10 minutes, social+multiw2v achieves . With only the first tweet () , quickly rising and plateauing above 0.74 after two hours.
- Access & Licensing: Public release via Figshare (DOI: 10.6084/m9.figshare.5688811), CC-BY 4.0, complete with pretrained word-embedding models.
The dataset establishes a realistic, temporally resolved benchmark for research on early detection of rumor, hoax, and memorialization phenomena in social-media streams (Zubiaga et al., 2018).
3. CelebA-Spoof: Large-Scale Face Anti-Spoofing
Often also referred to as SpoofCeleb (Zhang et al., 2020), CelebA-Spoof is a comprehensive, identity-labeled face anti-spoofing dataset supporting robust evaluation across environmental and device diversity.
- Content and Splits: Contains 625,537 images of 10,177 subjects, split into training (326,514; 5,000 ids), validation (80,513; 1,000 ids), and test (218,510; 4,177 ids) sets. Splits are identity-disjoint, with parity between live and spoofed images per split.
| Split | Images | Identities | Live/Spoof Ratio |
|---|---|---|---|
| Train | 326,514 | 5,000 | ~50/50 |
| Validation | 80,513 | 1,000 | ~50/50 |
| Test | 218,510 | 4,177 | ~50/50 |
- Acquisition Diversity: Recorded under 8 scene conditions (home/office × 4 illuminations) and 20 sensors (smartphones, webcams, tablets, CCTV).
- Annotation Schema: Each image is labeled with (1) live/spoof class, (2) one of 10 spoof attack types (varied print/photo/video/3D), and (3) 40 binary facial attributes from the original CelebA dataset.
- File Organization: Data are provided as per-subject subfolders, with annotation JSON specifying all metadata. Naming convention encodes subject ID, class, scene, sensor, and frame.
- Evaluation Protocols:
- Protocol 1—random identity-disjoint split (benchmark)
- Protocol 2—cross-illumination (generalization to unseen lighting)
- Protocol 3—cross-environment (office→home transfer)
- Protocol 4—cross-device (withheld devices at test time)
- Metrics:
- APCER (Attack Presentation Classification Error Rate):
- BPCER (Bona Fide Presentation Classification Error Rate):
- ACER (Average Classification Error Rate):
- EER (Equal Error Rate): lowest rate at which FPR equals FNR.
- Auxiliary Information Embedding Network (AENet): Multi-task ResNet backbone with three heads (binary spoof-class, spoof type [10-way], attribute vector [40-way]), trained with a composite loss:
Typical , .
- Benchmark Results: AENet achieves 0.27% ACER and 0.24% EER (Protocol 1). Protocols focused on unseen environments and devices remain challenging (AENet ACER: 7.80%, baseline 14.30%).
- Guidelines: Recommended optimization (Adam, lr), augmentation (flip, jitter, crop, blur), and considerations for robust evaluation.
CelebA-Spoof enables realistic anti-spoofing algorithm validation with detailed environmental and biometric annotation (Zhang et al., 2020).
4. Comparative Summary of SpoofCeleb Datasets
| Name | Modality | Task | Size | Unique Features | Citation |
|---|---|---|---|---|---|
| SpoofCeleb (SDD/SASV) | Speech | Deepfake+ASV robustness | 2.7M utterances/1,251 spkrs | 23 TTS/VC systems, in-the-wild, protocol-rich | (Jung et al., 18 Sep 2024) |
| SpoofCeleb (Death Hoaxes) | Text/social | Early rumor/hoax detection | 13M tweets, 4,007 events | Real/com/hoax, time-resolved, Wikidata mapping | (Zubiaga et al., 2018) |
| CelebA-Spoof | Face images | Face anti-spoofing | 625K images/10,177 ids | 10 attack types, 40 attributes, 8 scenes, 20 sensors | (Zhang et al., 2020) |
Each resource enables systematic, protocol-driven research with strong baseline results and facilitates transfer learning and benchmarking across domains.
5. Significance and Applications
SpoofCeleb datasets collectively serve as canonical resources for evaluating spoofing detection models under realistic, large-scale, and heterogeneous conditions. They enable:
- Advancement of deepfake and spoof-robust speaker verification, via speaker-disjoint, attack-disjoint, and environment-variable speech benchmarks (Jung et al., 18 Sep 2024).
- Rigorous study of temporal veracity classification, social rumor diffusion, and early detection algorithms in naturalistic, imbalanced social streams (Zubiaga et al., 2018).
- Comprehensive evaluation of face anti-spoofing systems—including resilience to new sensors, environments, and spoof modalities—through identity-disjoint, well-annotated face imagery (Zhang et al., 2020).
Researchers can leverage these datasets for benchmarking, transfer learning, protocol comparisons, and development of new discriminative or generative architectures addressing real-world spoofing challenges.