ASVspoof 5 Challenge Overview

Updated 14 January 2026

ASVspoof 5 is a challenge designed to advance detection of synthetic, manipulated, or adversarial speech using a diverse crowdsourced corpus.
It features two tracks—stand-alone countermeasures and Spoofing-Aware Speaker Verification—evaluated under both closed and open conditions with metrics like minDCF and EER.
The challenge underscores the importance of robust SSL models, multi-view fusion, effective calibration, and adversarial training to improve ASV system security.

ASVspoof 5 Challenge

ASVspoof 5 is the fifth and most ambitious edition of the ASVspoof series, designed to advance research on the detection of synthetic, manipulated, or adversarially perturbed speech in both stand-alone anti-spoofing and speaker-verification contexts. Building on earlier challenges, ASVspoof 5 introduces a substantially more diverse crowdsourced evaluation corpus, highly optimized attacks including adversarial filtering, and codec-induced distortions. The challenge features two main tracks: stand-alone countermeasure (CM; Track 1) and Spoofing-Aware Speaker Verification (SASV; Track 2), with both “closed” and “open” evaluation conditions. The results and system evaluations provide a benchmark for understanding vulnerabilities of current ASV and anti-spoofing technologies under real-world threats (Wang et al., 7 Jan 2026, Wang et al., 2024, Wang et al., 13 Feb 2025).

1. Data Corpus, Attack Protocols, and Evaluation Conditions

ASVspoof 5 pivots from prior editions by using a crowdsourced corpus built from the Multilingual LibriSpeech (MLS) dataset. The database comprises nearly 2,000 speakers (≈1,900 in official statistics; gender-balanced), each recorded under uncontrolled acoustic and device conditions. This large dataset provides training, development, and evaluation splits, which differ not only in spoofing methods but also in bona fide utterance statistics—an explicit challenge introduced over ASVspoof2019, which matched genuine speech across subsets (Weizman et al., 21 May 2025, Wang et al., 13 Feb 2025).

The protocol features 32 spoofing attack systems grouped into train/dev/eval partitions with complete attack disjointness between splits. Attacks span legacy and state-of-the-art zero-shot TTS systems, advanced VC pipelines, and for the first time, adversarial attacks: Malafide (designed to fool CMs via learned convolutional filtering) and Malacopula (targeting ASV systems via nonlinear adversarial perturbations) (Wang et al., 7 Jan 2026, Wang et al., 13 Feb 2025).

All evaluation utterances (bona fide and spoofed) in the eval set are subject to 11 codec conditions (including standard DSP codecs and DNN-based neural codecs like Encodec), sometimes applied successively, further increasing acoustic variability and data mismatch (Wang et al., 2024).

2. Tracks, System Types, Metrics, and Baselines

Track 1 focuses on stand-alone countermeasures, requiring binary bona fide/spoof classification at the utterance level. Track 2 addresses the integrated SASV task, where the system must determine, given an enrollment and test utterance, whether the test is bona fide from the target speaker, a zero-effort non-target, or a spoof (Wang et al., 2024, Wang et al., 7 Jan 2026).

Two evaluation conditions are specified:

Closed: Only corpus-internal data and provided models; external data only for ASV pretraining (e.g., VoxCeleb2).
Open: External data or foundation models (e.g., SSL speech backbones pretrained on LibriSpeech) permitted, but strictly no speaker overlap with eval speakers.

Metrics are tailored to each track:

Track 1: Primary metric is minimum Detection Cost Function (minDCF); secondary metrics include EER and cost of log-likelihood ratios ( $C_{\text{llr}}$ ). For minDCF,

$\mathrm{DCF}(\tau) = C_\text{miss} P_\text{miss}(\tau) (1-\pi_\text{spf}) + C_\text{fa} P_\text{fa}(\tau) \pi_\text{spf}$

Track 2: Architecture-agnostic min a-DCF is primary; secondary measures are ASV-constrained min t-DCF and t-EER, reflecting joint ASV/CM tradeoffs (Wang et al., 2024).

Baselines for Track 1 use RawNet2 and AASIST; for Track 2, fusion systems combine AASIST with ECAPA-TDNN ASV back-ends (Wang et al., 2024, Wang et al., 13 Feb 2025).

3. System Designs and Algorithms Across Tracks

State-of-the-art approaches in both tracks have converged on multi-component, ensemble pipelines, particularly in the open condition:

Self-Supervised Learning (SSL) Front Ends: WavLM, Wav2vec 2.0, HuBERT, UniSpeech-SAT, and Data2vec dominate as fixed (frozen) upstreams for both CM and SASV models. Ablations consistently show that the 5th encoder layer in transformer-based SSL models yields the most discriminative features for spoof detection, particularly when aligned with appropriate data augmentation (Xie et al., 2024, Rohdin et al., 2024, Zhu et al., 2024).
Temporal Multi-Scale and Multi-View Fusion: Systems such as in (Xie et al., 2024) exploit both short-context (e.g., 4s) and long-context (e.g., 16s) segments for multi-scale modeling, with per-segment CMs fused via weighted averaging (e.g., 0.25 per scale). Similarly, multi-view fusion (fusing CMs trained on different SSL layers) is deployed for improved robustness.
Graph-Attention and Backbones: AASIST-type architectures, with spectro-temporal graph-attention, often serve as downstream classifiers after SSL feature extraction. Configurations typically involve resizing the initial FC layer to match the SSL feature dimension (e.g., 1024→768), optional pooling, and a two-class output (Xie et al., 2024).
Adversarial and Flatness-Aware Optimizers: Recent solutions explicitly enforce flat minima (e.g., via Gradient Norm Aware Minimization, GAM), motivated by generalization needs under domain (codec/attack) shift as in (Xu et al., 2024). This approach regularizes parameter updates to improve robustness to unseen perturbations.
Data Augmentation: RIR convolution, MUSAN noise, frequency masking simulating high-frequency dropout (FreqMask), SpecAugment TimeMask, Mixup, amplitude mixing, and laundering attacks (simulating various real-world audio post-processes) are widely used. Key findings confirm that targetted FreqMask for high-frequency bands is especially effective in addressing frequency dropouts introduced by codecs (Xie et al., 2024, Ali et al., 2024).
Fusion and Calibration: Weighted linear score fusion, monotonic LLR calibration (logistic regression, Beta transforms, logit), and non-linear fusion (e.g., negative LogSumExp) are standard. Calibration is critical for achieving low actual DCF (actDCF) at Bayes threshold, a requirement highlighted in post-challenge evaluation (Wang et al., 7 Jan 2026).

4. Empirical Results and Principal Findings

The top-performing systems in open condition consistently leverage SSL feature ensembles, strong augmentation, and score-level calibration/fusion (Xie et al., 2024, Zhu et al., 2024, Xu et al., 2024):

System Class	Condition	minDCF	EER (%)	Comments
SOTA Open Ensemble	Open	0.0158	0.55	Temporal+multi-view fusion
SZU-AFS (A9+ensembles)	Open	0.115	4.04	DA+GAM+score fusion
RealityDefender SLIM	Open	0.1499	5.5	SSL contrastive pretraining
Baseline (AASIST)	Closed	0.7106	29.12	RawNet2 encoding
BUT W2v2-MHFA	Open	0.0848*	3.3*	SSL + attentional pooling

*Development set.

In the SASV track, multi-branch (ECAPA, WavLM), AASIST CM, embedding-level and score-level fusion, and joint BCE/a-DCF optimized DNNs yield ~0.07 min a-DCF in open condition, versus >0.5 for baselines (Kurnaz et al., 2024).

Empirical ablations reveal:

SSL features from earlier transformer layers outperform final layers for soft deepfake artifacts.
Multi-scale and multi-view CMs provide complementary evidence, critical for low detection cost at ultra-low miss rates.
Introducing RIR + MUSAN + FreqMask achieved significant performance improvements over single augmentation.
Calibration is frequently neglected—top minDCF systems often exhibit very poor actDCF, indicating scores are not proper LLRs until transformed (e.g., via logit or logistic regression) (Wang et al., 7 Jan 2026).
Strong domain shift exists: in-the-wild/out-of-domain sets see EERs >10%, even for SOTA open systems (Zhu et al., 2024).

5. Attack, Codec, and Domain-Specific Robustness

Challenge analysis highlights persistent vulnerabilities:

Codec Effects: DNN-based Encodec and narrow-band codecs degrade detection most; FreqMask-style augmentation is critical for mitigation but imperfect (Xie et al., 2024, Wang et al., 7 Jan 2026).
Adversarial Attacks: Malafide and Malacopula attacks significantly increase CM and ASV EERs by exploiting model internal statistics that are inadequately addressed by standard augmentation. Special augmentation or adversarial training is recommended.
Attack-Specific Difficulty: Legacy unit-selection TTS (MaryTTS, A19) remains hard despite being perceptually weak. Zero-shot neural TTS (YourTTS, XTTS) are detected more reliably by CMs but are highly effective as ASV attacks (Wang et al., 13 Feb 2025, Wang et al., 7 Jan 2026).
Generalization Collapse: PMF-based metrics and UMAP embedding analyses demonstrate drastically reduced separation between genuine and spoofed speech under matched or unseen evaluation conditions in ASVspoof 5 compared to ASVspoof2019 (Weizman et al., 21 May 2025).

6. Post-Challenge Insights, Limitations, and Roadmap

Post-challenge studies focus on calibration, cross-corpus generalization, and data-centric training:

Score Calibration: Proper LLR calibration (via logit/affine or monotonic transforms) transforms minDCF-top systems into actDCF-top systems, achieving Bayes-optimality over a wide range of priors (Wang et al., 7 Jan 2026).
Cross-Corpus Generalization: EERs for top systems jump from <5% (ASVspoof 5) to 10–18% on ASVspoof2019, ASVspoof2021, and in-the-wild testbeds, underlining persistent domain mismatch (Wang et al., 7 Jan 2026, Zhu et al., 2024).
Open Problems: Persistent challenges include domain adaptation, explicit modeling of genuine speech variability, and robust augmentation for codecs/adversaries. The adoption of actDCF or $C_{\text{llr}}$ as primary metrics and the development of source-tracing or explainable multi-class CMs are proposed for future editions.

Emerging approaches include low-rank adaptation for SSL backbones, advanced selection/pruning of training data, and focus on multilingual and cross-device robustness. A shift to multi-task or open-set configuration (detection + source recognition) and generative–discriminative hybrid architectures is anticipated as the field progresses (Wang et al., 7 Jan 2026).

7. Released Resources and Community Contributions

ASVspoof 5 provides:

Publicly available datasets (ASVspoof 5 protocol and baseline systems), with auxiliary sets (CommonVoice) for speaker encoder development.
Codec/compression pipelines, scoring scripts, and LLR-calibration toolkits through the official GitHub repository.
Restricted (by request for ethical reasons) access to attack-generation protocols and surrogate evaluation servers (Wang et al., 13 Feb 2025, Wang et al., 2024).

Collectively, ASVspoof 5 sets a new bar for realistic, diversified, and challenging evaluation of anti-spoofing and SASV systems, driving both methodological advancement and practical awareness of vulnerabilities in ASV under present-day and foreseeable attack regimes.