GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection (2407.02170v1)
Abstract: Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scale Log Gaussian Probability features. Secondly, the grouping technique is used to improve the classification accuracy by exposing the group cardinality while reducing both the number of parameters and the training time. The final score is obtained by ensemble of all group classifier outputs using the averaging method. Thirdly, the residual block is improved by including one activation function and one batch normalization layer. Finally, an ensemble-aware loss function is proposed to integrate the independent loss functions of all ensemble members. On the ASVspoof 2019 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.0227 and an EER of 0.79\%. On the ASVspoof 2021 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.2362 and an EER of 2.19\%, and represents a relative reductions of 31.4\% and 76.3\% compared with the LFCC-LCNN baseline.
- “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- “Pindrop Labs’ Submission to the ASVspoof 2021 Challenge,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 89–93.
- “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
- “End-to-end anti-spoofing with rawnet2,” in Proc. of ICASSP, 2021, pp. 6369–6373.
- “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in Proc. of ICASSP, 2022, pp. 6367–6371.
- “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, December 3-6, 2012, Lake Tahoe, Nevada, United States, 2012, pp. 1106–1114.
- “Resnext and res2net structures for speaker verification,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 301–307.
- “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
- “Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification,” in Proc. Interspeech, 2022, pp. 4371–4375.
- “STC Antispoofing Systems for the ASVspoof2021 Challenge,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 61–67.
- “Two-path gmm-resnet and gmm-senet for asv spoofing detection,” in Proc. of ICASSP, 2022, pp. 6377–6381.
- “Group GMM-ResNet for Detection of Synthetic Speech Attacks,” in Proc. Interspeech, 2023, pp. 3187–3191.
- “Multi-Path GMM-MobileNet Based on Attack Algorithms and Codecs for Synthetic Speech and Deepfake Detection,” in Proc. Interspeech, 2022, pp. 4795–4799.
- “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022, pp. 2228–2232.
- “MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification,” in Proc. Interspeech, 2022, pp. 306–310.
- Juan M. Martín-Doñas and A. Álvarez, “The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge,” in Proc. of ICASSP, 2022, pp. 9241–9245.
- “A convnet for the 2020s,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11966–11976.
- “Knowledge distillation by on-the-fly native ensemble,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. 2018, vol. 31, Curran Associates, Inc.
- “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech, 2019, pp. 1008–1012.
- “ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 47–54.
- “Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2195–2210, 2020.
- “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in Proc. of ICASSP, 2022, pp. 6382–6386.
- “End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 1–8.
- Z. Wang and J. H.L. Hansen, “Audio Anti-spoofing Using Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning,” in Proc. Interspeech, 2022, pp. 376–380.
- J. Kim and S. Ban, “Phase-aware spoof speech detection based on res2net with phase network,” in Proc. of ICASSP, 2023.
- “Learning from yourself: A self-distillation method for fake speech detection,” in Proc. of ICASSP, 2023.
- “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 75–82.
- R. K. Das, “Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021, pp. 29–36.