ASVspoof 5 Dataset Overview

Updated 5 October 2025

ASVspoof 5 Dataset is a crowdsourced speech corpus built from MLS data, offering diverse acoustic conditions and 32 sophisticated attack algorithms.
The dataset features rigorously defined speaker-disjoint partitions and auxiliary data to support robust, real-world evaluation protocols.
It enables the development of countermeasure and ASV systems by addressing spoofing, deepfake, and adversarial audio attacks in realistic environments.

ASVspoof 5 Dataset is a large-scale, crowdsourced resource developed for the fifth edition of the ASVspoof challenge, designed to drive research in the detection of spoofing, deepfake, and adversarial audio attacks against automatic speaker verification (ASV) systems. In contrast to studio-quality corpora used in previous editions, ASVspoof 5 leverages the Multilingual Librispeech (MLS) English partition, collecting speech from thousands of speakers recorded with consumer devices under highly varied, real-world acoustic conditions. The corpus is accompanied by a suite of 32 attack algorithms, including legacy and contemporary text-to-speech (TTS), voice conversion (VC), and—for the first time—adversarial attacks engineered using surrogate detection models. The resource includes carefully defined speaker-disjoint partitions, auxiliary data for robust encoder development, and extensive evaluation protocols, making it a comprehensive testbed for anti-spoofing and spoofing-robust speaker verification systems (Wang et al., 13 Feb 2025, Wang et al., 16 Aug 2024).

1. Dataset Design and Acquisition

ASVspoof 5 marks a fundamental shift in database composition by sourcing its speech from crowdsourced environments rather than controlled studios. The entire corpus is based on the MLS English dataset, itself derived from LibriVox public domain audiobooks, in which volunteer speakers record audio in their own environments and with heterogeneous device setups (Wang et al., 13 Feb 2025). Over 4,000 individual speakers are represented, spanning diverse genders and ages, ensuring broad coverage of speaker and acoustic variability.

Speech samples are divided into three disjoint subsets—training, development, and evaluation—so that no speaker overlaps between these. Each subset consists of bona fide (genuine) utterances and spoofed samples. Spoofed utterances are produced by applying algorithms to the source speech, either synthesizing new audio with TTS/VC methods or modifying the waveforms via adversarial or codec effects. These subsets are further supported by auxiliary data from ~30,000 additional speakers used for training speaker encoders, which are instrumental for certain attack generation algorithms.

This partitioning strategy, combined with protocol definitions for attack model training and surrogate model development, allows for rigorous speaker-independent evaluation of ASV and countermeasure (CM) systems (Wang et al., 13 Feb 2025).

2. Attack Algorithms and Categories

The ASVspoof 5 dataset incorporates a comprehensive spectrum of 32 attack algorithms, grouped into TTS, VC, and adversarial attacks (Wang et al., 13 Feb 2025, Wang et al., 16 Aug 2024). Attacks are crowdsourced and optimized via surrogate models to defeat detection systems. Notable categories include:

Zero-shot TTS: Models such as Glow-TTS, Grad-TTS, FastPitch, VITS, XTTS, and YourTTS create speech in arbitrary speaker voices without explicit target adaptation.
Few-shot TTS: Legacy unit-selection synthesis (e.g., MaryTTS) piecewise concatenates speech units, often yielding low perceptual quality but high detector error rates.
Voice Conversion: Methods such as StarGAN-VC, cycle-GAN VC, and zero-shot VC transfer vocal characteristics between speakers.
Adversarial Attacks: Malafide and Malacopula filters are applied to spoofed utterances. Malafide leverages non-causal LTI filtering to manipulate CM scores, while Malacopula exploits speaker-specific nonlinear filters to optimize for ASV embedding similarity. Some attacks (e.g., A30) combine multiple adversarial methods.

Codec and compression variations are also introduced in the evaluation set using systems such as opus, AMR, speex, Encodec, and bandwidth reductions (notably down to 8 kHz), increasing the realism and variability of attack scenarios (Wang et al., 13 Feb 2025, Wang et al., 16 Aug 2024, Weizman et al., 21 May 2025).

3. Partitioning, Protocols, and Auxiliary Data

ASVspoof 5 is structured into seven partitions to support attack algorithm training, surrogate model validation, and robust evaluation (Wang et al., 13 Feb 2025):

Training Set: Used to develop detection and verification systems; includes bona fide utterances and spoofed audio generated from a primary group of attack methods.
Development Set: Split to evaluate target and non-target trials, with spoofed samples created from a secondary, non-overlapping set of attacks.
Evaluation Set: Used for final challenge assessments, combining target/non-target, a wider set of spoofing algorithms, and all adversarial modifications.
Surrogate Sets: Designed for optimization of attacks via surrogate CM and ASV systems—critical for creating adversarial examples.
Auxiliary Data: Speech from a large pool of speakers (not included in evaluation) to train speaker encoders required for zero-shot and VC attack algorithms.

Partitioning prevents overlap between speakers in any training or evaluation role, eliminating shortcut cues and supporting robust model assessment.

4. Evaluation Metrics and Baselines

Two main evaluation metrics are defined across challenge tracks (Wang et al., 13 Feb 2025, Wang et al., 16 Aug 2024):

Track 1 (Countermeasure Detection): The normalized detection cost function (minDCF) quantifies the tradeoff between miss and false alarm rates:

$DCF(\tau_{cm}) = \beta \cdot P_{miss}^{cm}(\tau_{cm}) + P_{fa}^{cm}(\tau_{cm})$

where $\beta = (C_{miss} \cdot (1 - \pi_{spf})) / (C_{fa} \cdot \pi_{spf})$ , with $C_{miss}=1$ , $C_{fa}=10$ , $\pi_{spf}=0.05$ .

Track 2 (Spoofing-Robust ASV, SASV): The architecture-agnostic DCF (a-DCF) is

$a\text{-}DCF(\tau_{sasv}) = \alpha \cdot P_{miss}^{sasv}(\tau_{sasv}) + (1-\gamma) \cdot P_{fa,non}^{sasv}(\tau_{sasv}) + \gamma\cdot P_{fa,spf}^{sasv}(\tau_{sasv})$

where $\alpha$ and $\gamma$ incorporate the costs and priors for target, non-target, and spoofed cases.

Supporting metrics include actDCF (actual cost at the "Bayesian" threshold), cost of log-likelihood ratios ( $C_{llr}$ ) for calibration, tandem DCF (t-DCF) for ASV+CM fusion, and spoof EER.

Baseline systems such as ECAPA-TDNN (ASV), RawNet2 and AASIST (CM), and MFA-Conformer (end-to-end SASV) are provided and trained/executed on challenge data. These systems are significantly challenged by advanced attacks, with error rates that leave considerable room for improvement (Wang et al., 16 Aug 2024, Wang et al., 13 Feb 2025).

5. Comparative Database Analysis

ASVspoof 5 shows major divergence from previous releases (notably ASVspoof2019) in database conditions (Weizman et al., 21 May 2025). Whereas earlier editions maintained matched bona fide speech statistics across training and evaluation, ASVspoof 5 deliberately introduces mismatches (speaker profiles, acoustic environment, codec distortions) on both genuine and spoofed samples. Quantitative comparison of PMF-based embeddings demonstrates:

Enhanced overlap between bona fide and spoof PMFs in ASVspoof 5, complicating separation for CM systems.
Higher miss rates (e.g., up to 90%) are observed when transitioning from ASVspoof2019-designed systems to ASVspoof 5 evaluation.
Similarity metrics (Symmetric KL divergence, Hellinger distance, KS distance) confirm the statistical convergence of bona fide and spoofed distributions under ASVspoof 5 protocols.

These shifts raise the difficulty of both detection and verification tasks and highlight the necessity for careful database validation and robust, spectrum-aware modeling protocols (Weizman et al., 21 May 2025).

6. Practical Applications and Baseline Validation

By design, ASVspoof 5 is intended for developing both stand-alone CM detectors and integrated SASV systems capable of operating in realistic telephony and multimedia scenarios where speech signals may be subject to spoofing, deepfake, or adversarial manipulations (Wang et al., 13 Feb 2025, Kurnaz et al., 2 Oct 2025). Auxiliary data augments the development of encoder models used in zero-shot TTS attacks, while evaluation metrics such as DCF and a-DCF enable practical benchmarking of error type relevance.

Experimental validation in (Wang et al., 13 Feb 2025) demonstrates:

For baseline ECAPA-TDNN ASV and RawNet2/AASIST CM systems, spoofed and adversarial utterances produce notable increases in equal error rates (EERs) compared to previous editions.
Legacy attacks, especially low-quality unit-selection TTS, may trigger high errors despite low perceptual MOS scores, showing the need for multidimensional benchmarking.
Codec-related artefacts (including those from neural codecs and bandwidth limits) produce distortion effects that challenge robustness even when countermeasures are well established in matched environments.

This validates the necessity of advanced augmentation and calibration in model design.

7. Impact, Availability, and Future Research Directions

ASVspoof 5 represents a substantial advance in the standardization and realism of spoofing databases. All resources, except for attack protocol generators, are openly released to the community with detailed evaluation scripts and publicly accessible baseline implementations. The protocol-driven structure—encompassing attack optimization, partitioning, and auxiliary data—supports reproducible research and continued progress in anti-spoofing and robust ASV.

Future work, suggested by the database’s challenges and by evaluation results, includes:

Developing spectrum-aware modeling strategies to handle overlap between bona fide and spoofed distributions.
Employing advanced augmentation (e.g., laundering attacks) and adversarial training for robustness to codec and low-bandwidth artefacts (Ali et al., 1 Oct 2024).
Applying modular, nonlinear fusion and calibrated integration strategies to exploit complementary cues from speaker and spoof branches (Kurnaz et al., 2 Oct 2025).
Moving toward more interpretable architectures (weighted cosine, score fusion) that expose subsystem vulnerabilities and facilitate targeted improvements.

ASVspoof 5 sets the benchmark for evaluating next-generation audio anti-spoofing technology and addressing evolving real-world fraud vectors in automatic speaker verification and deepfake detection.