Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake (2404.00809v1)
Abstract: In this work, we investigate multilingual speech Pre-Trained models (PTMs) for Audio deepfake detection (ADD). We hypothesize that multilingual PTMs trained on large-scale diverse multilingual data gain knowledge about diverse pitches, accents, and tones, during their pre-training phase and making them more robust to variations. As a result, they will be more effective for detecting audio deepfakes. To validate our hypothesis, we extract representations from state-of-the-art (SOTA) PTMs including monolingual, multilingual as well as PTMs trained for speaker and emotion recognition, and evaluated them on ASVSpoof 2019 (ASV), In-the-Wild (ITW), and DECRO benchmark databases. We show that representations from multilingual PTMs, with simple downstream networks, attain the best performance for ADD compared to other PTM representations, which validates our hypothesis. We also explore the possibility of fusion of selected PTM representations for further improvements in ADD, and we propose a framework, MiO (Merge into One) for this purpose. With MiO, we achieve SOTA performance on ASV and ITW and comparable performance on DECRO with current SOTA works.
- Deep Residual Neural Networks for Audio Spoofing Detection. In Proc. Interspeech 2019, pages 1078–1082.
- Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition. In Proc. Interspeech 2022, pages 5145–5149.
- Transferring audio deepfake detection capability across languages. In Proceedings of the ACM Web Conference 2023, pages 2033–2044.
- XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pages 2278–2282.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- Cross-Lingual Cross-Age Adaptation for Low-Resource Elderly Speech Emotion Recognition. In Proc. INTERSPEECH 2023, pages 3352–3356.
- Waveform boundary detection for partially spoofed audio. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- Unispeech-sat: Universal speech representation learning with speaker aware pre-training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6152–6156. IEEE.
- Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430.
- Deepfake speech detection through emotion recognition: a semantic approach. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8962–8966. IEEE.
- Anti-spoofing using transfer learning with variational information bottleneck. arXiv preprint arXiv:2204.01387.
- A Light Convolutional GRU-RNN Deep Feature Extractor for ASV Spoofing Detection. In Proc. Interspeech 2019, pages 1068–1072.
- Classifiers for synthetic speech detection: a comparison. In Proc. Interspeech 2015, pages 2057–2061.
- Self-Supervised Spoofing Audio Detection Scheme. In Proc. Interspeech 2020, pages 4223–4227.
- Self-Supervised Pre-Training with Acoustic Configurations for Replay Spoofing Detection. In Proc. Interspeech 2020, pages 1091–1095.
- Gokul Karthik Kumar and Karthik Nandakumar. 2022. Hate-clipper: Multimodal hateful meme classification based on cross-modal interaction of clip features. In Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI), pages 171–183.
- Stc antispoofing systems for the asvspoof2019 challenge. arXiv preprint arXiv:1904.05576.
- Experimental case study of self-supervised learning for voice spoofing detection. IEEE Access, 11:24216–24226.
- A capsule network based approach for detection of audio spoofing attacks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6359–6363. IEEE.
- Improved lightcnn with attention modules for asv spoofing detection. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE.
- How to boost anti-spoofing with x-vectors. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 593–598. IEEE.
- Does Audio Deepfake Detection Generalize? In Proc. Interspeech 2022, pages 2783–2787.
- Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516.
- Deep features for automatic spoofing detection. Speech Communication, 85:43–52.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Statnet: Spectral and temporal features based multi-task network for audio spoofing detection. In 2022 IEEE International Joint Conference on Biometrics (IJCB), pages 1–9. IEEE.
- SpeechBrain: A general-purpose speech toolkit. ArXiv:2106.04624.
- A comparison of features for synthetic speech detection. In Proc. Interspeech 2015, pages 2087–2091.
- X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5329–5333. IEEE.
- Catherine Stupp. 2019. Fraudsters used ai to mimic ceo’s voice in unusual cybercrime case. The Wall Street Journal, 30(08).
- End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Proc. Interspeech 2018, pages 681–685.
- Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features. In Proc. INTERSPEECH 2023, pages 3844–3848.
- Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114.
- SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
- Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth annual conference of the international speech communication association.
- Speech Self-Supervised Representation Benchmarking: Are We Doing it Right? In Proc. INTERSPEECH 2023, pages 2873–2877.
- Audio anti-spoofing based on audio feature fusion. Algorithms, 16(7):317.
- Orchid Chetia Phukan (38 papers)
- Gautam Siddharth Kashyap (11 papers)
- Arun Balaji Buduru (47 papers)
- Rajesh Sharma (73 papers)