Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Attention-based Contrastive Learning for Audio Spoof Detection (2407.03514v1)

Published 3 Jul 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.
  2. A. Hatamizadeh, H. Yin, J. Kautz, and P. Molchanov, “Global context vision transformers,” arXiv preprint arXiv:2206.09959, 2022. [Online]. Available: https://arxiv.org/abs/2206.09959
  3. Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “Ssast: Self-supervised audio spectrogram transformer,” in Proc. AAAI Conference on Artificial Intelligence, 2022, pp. 10 699–10 709.
  4. Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” in Proc. Interspeech, 2021, pp. 571–575.
  5. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” Proc. IEEE ICASSP, pp. 776–780, 2017.
  6. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” IEEE ICASSP, pp. 5206–5210, 2015.
  7. J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” in ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Countermeasures Challenge, 2021.
  8. A. Nautsch, X. Wang, N. W. D. Evans, T. H. Kinnunen, V. Vestman, M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K.-A. Lee, “Asvspoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, pp. 252–265, 2021.
  9. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” in in Proc. NeurIPS, 2020, pp. 1–13.
  10. G. Koch, R. Zemel, R. Salakhutdinov et al., “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2, no. 1, 2015.
  11. Y. Qian, M. Lin, X. Sun, Z. Tan, and R. Jin, “Entroformer: A transformer-based entropy model for learned image compression,” in Proc. ICLR, 2022, pp. 1–15.
  12. L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, “Volo: Vision outlooker for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13, 2022.
  13. C.-F. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proc. ICCV, 2021, pp. 347–356.
  14. T. Xu, W. Chen, P. Wang, F. Wang, H. Li, and R. Jin, “Cdtrans: Cross-domain transformer for unsupervised domain adaptation,” in Proc. ICLR, 2022, pp. 1–14.
  15. N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with transformer network,” in Proc. AAAI Conference on Artificial Intelligence, 2018, pp. 6706–6713.
  16. R. Hu and A. Singh, “Unit: Multimodal multitask learning with a unified transformer,” in Proc. ICCV, 2021, pp. 1419–1429.
  17. A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in Proc. IEEE ICASSP, 2021, pp. 3875–3879.
  18. Y. Xie, Z. Zhang, and Y. Yang, “Siamese network with wav2vec feature for spoofing speech detection,” in Proc. Interspeech, 2021, pp. 4269–4273.
  19. M. Sahidullah, T. H. Kinnunen, and C. Hanilçi, “A comparison of features for synthetic speech detection,” in Proc. Interspeech, 2015, pp. 2087–2091.
  20. M. Todisco, H. Delgado, and N. W. D. Evans, “Constant q cepstral coefficients: A spoofing countermeasure for automatic speaker verification,” Comput. Speech Lang., vol. 45, pp. 516–535, 2017.
  21. Y. Gao, T. Vuong, M. Elyasi, G. Bharaj, and R. Singh, “Generalized spoofing detection inspired from audio generation artifacts,” in Proc. Interspeech, 2021, pp. 4185–4188.
  22. N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi, “Leaf: A learnable frontend for audio classification,” in Proc. ICLR, 2021, pp. 1–16.
  23. G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, and A. Kozlov, “Stc antispoofing systems for the asvspoof2019 challenge,” in Proc. Interspeech, 2019, pp. 1033–1037.
  24. H. Tak, J. Patino, M. Todisco, A. Nautsch, N. W. D. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in Proc. IEEE ICASSP, 2020, pp. 6369–6373.
  25. G. Hua, A. B. J. Teoh, and H. Zhang, “Towards end-to-end synthetic speech detection,” IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021.
  26. X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. H. Kinnunen, M. Todisco, J. Yamagishi, N. W. D. Evans, A. Nautsch, and K.-A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” ArXiv, vol. abs/2210.02437, 2022. [Online]. Available: https://arxiv.org/abs/2210.02437
  27. H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in Proc. Odyssey, 2022, pp. 112–119.
  28. Y. Eom, Y. Lee, J. S. Um, and H. Kim, “Anti-spoofing using transfer learning with variational information bottleneck,” in Proc. Interspeech, 2022, pp. 3568–3572.
  29. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, pp. 12 449–12 460.
  30. K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. ACM International Conference on Multimedia, 2020, pp. 484––492.
  31. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” Proc. ICML, pp. 1–11, 2020.
  32. E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, and E. Dupoux, “Data augmenting contrastive learning of speech representations in the time domain,” in Proc. IEEE Spoken Language Technology Workshop, 2020, pp. 215–222.
  33. H. Tak, M. R. Kamble, J. Patino, M. Todisco, and N. W. D. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in Proc. IEEE ICASSP, 2021, pp. 6382–6386.
  34. A. Tomilov, A. F. Svishchev, M. Volkova, A. Chirkovskiy, A. S. Kondratev, and G. Lavrentyeva, “Stc antispoofing systems for the asvspoof2021 challenge,” 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 61–67, 2021.
  35. M. Ravanelli, T. Parcollet, and Y. Bengio, “The pytorch-kaldi speech recognition toolkit,” in Proc. IEEE ICASSP, 2019, pp. 6465–6469.
  36. T. Chen, E. el Khoury, K. Phatak, and G. Sivaraman, “Pindrop labs’ submission to the asvspoof 2021 challenge,” 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 90–93, 2021.
  37. A. Cohen, I. Rimon, E. Aflalo, and H. H. Permuter, “A study on data augmentation in voice anti-spoofing,” Speech Commun., vol. 141, pp. 56–67, 2021.
  38. A. Satt, S. Rozenberg, R. Hoory et al., “Efficient emotion recognition from speech using deep learning on spectrograms.” in Proc. Interspeech, 2017, pp. 1089–1093.
  39. W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A novel learnable dictionary encoding layer for end-to-end language identification,” in Proc. IEEE ICASSP, 2018, pp. 5189–5193.
Citations (4)

Summary

We haven't generated a summary for this paper yet.