FairSSD: Understanding Bias in Synthetic Speech Detectors (2404.10989v1)
Abstract: Methods that can generate synthetic speech which is perceptually indistinguishable from speech recorded by a human speaker, are easily available. Several incidents report misuse of synthetic speech generated from these methods to commit fraud. To counter such misuse, many methods have been proposed to detect synthetic speech. Some of these detectors are more interpretable, can generalize to detect synthetic speech in the wild and are robust to noise. However, limited work has been done on understanding bias in these detectors. In this work, we examine bias in existing synthetic speech detectors to determine if they will unfairly target a particular gender, age and accent group. We also inspect whether these detectors will have a higher misclassification rate for bona fide speech from speech-impaired speakers w.r.t fluent speakers. Extensive experiments on 6 existing synthetic speech detectors using more than 0.9 million speech signals demonstrate that most detectors are gender, age and accent biased, and future work is needed to ensure fairness. To support future research, we release our evaluation dataset, models used in our study and source code at https://gitlab.com/viper-purdue/fairssd.
- M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” Proceedings of the ISCA Interspeech, pp. 1008–1012, September 2019, Graz, Austria.
- K. Bhagtani, A. K. S. Yadav, E. R. Bartusiak, Z. Xiang, R. Shao, S. Baireddy, and E. J. Delp, “An Overview of Recent Work in Multimedia Forensics,” Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval, pp. 324–329, August 2022, Virtual.
- J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, December 2020.
- P. Dhariwal and A. Q. Nichol, “Diffusion Models Beat GANs on Image Synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, December 2021, Virtual.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10 684–10 695, 2022.
- “Speech Synthesis, ElevenLabs,” December 2023. [Online]. Available: https://elevenlabs.io/
- H. Kim, S. Kim, J. Yeom, and S. Yoon, “UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data,” Proceedings of the ISCA Interspeech, pp. 3038–3042, August 2023, Dublin, Ireland.
- Coqui, “XTTS,” September 2023. [Online]. Available: https://docs.coqui.ai/en/latest/models/xtts.html
- R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech,” Proceedings of the ACM International Conference on Multimedia, pp. 2595–2605, October 2022, Lisbon, Portugal.
- E. Flitter and S. Cowley, “Voice Deepfakes Are Coming for Your Bank Balance.” The New York Times, August 2023. [Online]. Available: https://www.nytimes.com/2023/08/30/business/voice-deepfakes-bank-scams.html
- B. Nguyen, “A couple in Canada were reportedly scammed out of $21,000 after getting a call from an AI-generated voice pretending to be their son.” The New York Times, March 2023. [Online]. Available: https://www.businessinsider.com/couple-canada-reportedly-lost-21000-in-ai-generated-voice-scam-2023-3
- B. MOLLMAN, “Scammers are using voice-cloning A.I. tools to sound like victims’ relatives in desperate need of financial help. It’s working.” The New York Times, March 2023. [Online]. Available: https://fortune.com/2023/03/05/scammers-ai-voice-cloning-tricking-victims-sound-like-relatives-needing-money/
- P. Verma, “They thought loved ones were calling for help. It was an AI scam.” The Washington Post, March 2023. [Online]. Available: https://www.washingtonpost.com/technology/2023/03/05/ai-voice-scam/
- G. Hua, A. B. J. Teoh, and H. Zhang, “Towards End-to-End Synthetic Speech Detection,” IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, June 2021.
- X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” arXiv preprint, 2022.
- H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,” Proceedings of the Speaker and Language Recognition Workshop, Odyssey, pp. 112–119, July 2022, Beijing, China.
- K. Li, X.-M. Zeng, J.-T. Zhang, and Y. Song, “Convolutional recurrent neural network and multitask learning for manipulation region location,” Proceedings of IJCAI Workshop on Deepfake Audio Detection and Analysis, pp. 18–22, August 2023, Macao.
- Z. Zhang, X. Yi, and X. Zhao, “Fake Speech Detection Using Residual Network with Transformer Encoder,” Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, p. 13–22, June 2021, virtual Event, Belgium.
- C. Sun, S. Jia, S. Hou, and S. Lyu, “Ai-synthesized voice detection using neural vocoder artifacts,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 904–912, June 2023, Vancouver, Canada.
- Z. Xiang, A. K. S. Yadav, S. Tubaro, P. Bestagini, and E. J. Delp, “Extracting efficient spectrograms from mp3 compressed speech signals for synthetic speech detection,” Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, p. 163–168, 2023, Chicago, IL, USA.
- J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado, X. Wang, N. Evans, T. Kinnunen, K. Lee, V. Vestman, and A. Nautsch, “ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database,” University of Edinburgh. The Centre for Speech Technology Research, March 2019. [Online]. Available: https://www.asvspoof.org/index2019.html
- N. M. Müller, P. Czempin, F. Dieckmann, A. Froghyar, and K. Böttinger, “Does audio deepfake detection generalize?” Proceedings of the ISCA Interspeech, September 2022, Incheon, Korea.
- A. K. Singh Yadav, Z. Xiang, K. Bhagtani, P. Bestagini, S. Tubaro, and E. J. Delp, “PS3DT: Synthetic Speech Detection Using Patched Spectrogram Transformer,” Proceedings of the IEEE International Conference on Machine Learning and Applications, pp. 496–503, 2023, Florida, USA.
- A. K. S. Yadav, K. Bhagtani, Z. Xiang, P. Bestagini, S. Tubaro, and E. J. Delp, “DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection,” arXiv:2304.03323, April 2023.
- S.-Y. Lim, D.-K. Chae, and S.-C. Lee, “Detecting Deepfake Voice Using Explainable Deep Learning Techniques,” Applied Sciences, vol. 12, no. 8, 2022.
- H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco, “An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification ,” Proceedings of the Speaker and Language Recognition Workshop, pp. 333–340, November 2020, Tokyo, Japan.
- D. Salvi, P. Bestagini, and S. Tubaro, “Towards frequency band explainability in synthetic speech detection,” in European Signal Processing Conference (EUSIPCO). IEEE, 2023.
- A. K. Singh Yadav, K. Bhagtani, Z. Xiang, P. Bestagini, S. Tubaro, and E. J. Delp, “DSVAE: Disentangled Representation Learning for Synthetic Speech Detection,” Proceedings of the IEEE International Conference on Machine Learning and Applications, pp. 472–479, 2023, Florida, USA.
- A. K. Singh Yadav, Z. Xiang, E. R. Bartusiak, P. Bestagini, S. Tubaro, and E. J. Delp, “ASSD: Synthetic Speech Detection in the AAC Compressed Domain,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1–5, June 2023, Rhodes Island, Greece.
- S. Feng, O. Kudina, B. M. Halpern, and O. Scharenborg, “Quantifying bias in automatic speech recognition,” arXiv preprint arXiv:2103.15122, March 2021.
- W. T. Hutiri and A. Y. Ding, “Bias in automated speaker recognition,” Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, p. 230–247, June 2022, Seoul, Republic of Korea.
- M. K. Ngueajio and G. Washington, “Hey ASR System! Why Aren’t You More Inclusive?” Proceedings of International Conference on Human-Computer Interaction, November 2022.
- Y. Xu, P. Terhörst, K. Raja, and M. Pedersen, “A comprehensive analysis of ai biases in deepfake detection with massively annotated databases,” arXiv preprint arXiv:2208.05845, September 2023.
- M. Pu, M. Y. Kuan, N. T. Lim, C. Y. Chong, and M. K. Lim, “Fairness evaluation in deepfake detection models using metamorphic testing,” Proceedings of the ACM International Workshop on Metamorphic Testing, p. 7–14, January 2023, Pittsburgh, Pennsylvania.
- M. Organization, “Common Voice Corpus 16.1.” January 2024. [Online]. Available: https://commonvoice.mozilla.org/en/datasets
- C. Lea, V. Mitra, A. Joshi, S. Kajarekar, and J. Bigham, “Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6798–6802, June 2021, Toronto, Canada.
- S. P. Bayerl, D. Wagner, T. Bocklet, and K. Riedhammer, “The Influence of Dataset-Partitioning on Dysfluency Detection Systems,” Text, Speech, and Dialogue, vol. 13502, 2022.
- M. Sahidullah and G. Saha, “Design, Analysis, and Experimental Evaluation of Block Based Transformation in MFCC Computation for Speaker Recognition,” Speech Communication, vol. 54, pp. 543–565, May 2012.
- M. Todisco, H. Delgado, and N. Evans, “Constant Q Cepstral Coefficients: A Spoofing Countermeasure for Automatic Speaker Verification,” Computer Speech & Language, vol. 45, pp. 516–535, September 2017.
- S. Borzì, O. Giudice, F. Stanco, and D. Allegra, “Is synthetic voice detection research going into the right direction?” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 71–80, June 2022, New Orleans, USA.
- X. Li, N. Li, C. Weng, X. Liu, D. Su, D. Yu, and H. Meng, “Replay and Synthetic Speech Detection with Res2Net Architecture,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6354–6358, June 2021, Toronto, Canada.
- M. Sahidullah, T. Kinnunen, and C. Hanilçi, “A comparison of features for synthetic speech detection,” Proceedings of the ISCA Interspeech, pp. 2087–2091, September 2015, Dresden, Germany.
- F. Akdeniz and Y. Becerikli, “Detection of Copy-Move Forgery in Audio Signal with Mel Frequency and Delta-Mel Frequency Kepstrum Coefficients,” Proceedings of the Innovations in Intelligent Systems and Applications Conference, pp. 1–6, October 2021, Elazig, Turkey.
- M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep Residual Neural Networks for Audio Spoofing Detection,” Proceedings of the ISCA Interspeech, pp. 1078–1082, September 2019, Graz, Austria.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016, Las Vegas, NV.
- H. Zeinali, T. Stafylakis, G. Athanasopoulou, J. Rohdin, I. Gkinis, L. Burget, and J. Černockỳ, “Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge,” Proceedings of the ISCA Interspeech, pp. 1073–1077, September 2019, Graz, Austria.
- E. R. Bartusiak, K. Bhagtani, A. K. S. Yadav, and E. J. Delp, “Transformer Ensemble for Synthesized Speech Detection,” Proceedings of the Asilomar Conference on Signals, Systems, and Computers, October 2023, Pacific Grove, California, USA.
- S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement of the Psychological Magnitude Pitch,” Journal of the Acoustical Society of America, vol. 8, pp. 185–190, June 1937.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, and I. Polosukhin, “Attention is All You Need,” Proceedings of the Neural Information Processing Systems, December 2017, Long Beach, CA.
- K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient Training of Audio Transformers with Patchout,” Proceedings of the ISCA Interspeech, pp. 2753–2757, September 2022, Incheon, Korea.
- Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-Supervised Audio Spectrogram Transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 10 699–10 709, October 2022, Virtual.
- Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” Proceedings of the ISCA Interspeech, pp. 571–575, August 2021, Brno, Czech Republic.
- A. K. S. Yadav, E. Bartusiak, K. Bhagtani, and E. J. Delp, “Synthetic Speech Attribution using Self Supervised Audio Spectrogram Transformer,” Proceedings of the IS&T Media Watermarking, Security, and Forensics Conference, Electronic Imaging Symposium, January 2023, san Francisco, CA.
- K. Bhagtani, E. R. Bartusiak, A. K. S. Yadav, P. Bestagini, and E. J. Delp, “Synthesized speech attribution using the patchout spectrogram attribution transformer,” Proceedings of the ACM Workshop on Information Hiding and Multimedia Security, p. 157–162, June 2023, Chicago, IL, USA.
- A. K. Singh Yadav, K. Bhagtani, S. Baireddy, P. Bestagini, S. Tubaro, and E. J. Delp, “Mdrt: Multi-domain synthetic speech localization,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 11 171–11 175, 2024, Seoul, South Korea.
- K. Bhagtani, A. K. S. Yadav, Z. Xiang, P. Bestagini, and E. J. Delp, “FGSSAT : Unsupervised Fine-Grain Attribution of Unknown Speech Synthesizers Using Transformer Networks,” Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pp. 1135–1140, 2023, Pacific Grove, CA.
- L. Zhang, X. Wang, E. Cooper, and J. Yamagishi, “Multi-task Learning in Utterance-level and Segmental-level Spoof Detection,” Proceedings of the Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 9–15, September 2021, Online.
- Y. Zhu, Y. Chen, Z. Zhao, X. Liu, and J. Guo, “Local self-attention based hybrid multiple instance learning for partial spoof speech detection,” ACM Transactions on Intelligent Systems and Technology, August 2023.
- A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” October 2020.
- L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2023.
- J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An Ontology and Human-labeled Dataset for Audio Events,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, March 2017, New Orleans, LA.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR Corpus Based on Public Domain Audio Books,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, April 2015, Queensland, Australia.
- W. Ge, J. Patino, M. Todisco, and N. Evans, “Explaining Deep Learning Models for Spoofing and Deepfake Detection with Shapley Additive Explanations,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6387–6391, May 2022, Singapore.
- K. Kärkkäinen and J. Joo, “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation,” Proceedings of IEEE Winter Conference on Applications of Computer Vision, pp. 1547–1557, January 2021, Hawaii, USA.
- N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness in machine learning,” ACM Computing Surveys, vol. 54, no. 6, pp. 1–35, July 2021.
- C. Hazirbas, J. Bitton, B. Dolhansky, J. Pan, A. Gordo, and C. C. Ferrer, “Towards measuring fairness in ai: The casual conversations dataset,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 4, no. 3, pp. 324–332, 2022.
- M. Masood, M. Nawaz, K. M. Malik, A. Javed, A. Irtaza, and H. Malik, “Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward,” Applied Intelligence, vol. 53, no. 4, pp. 3974–4026, June 2023.
- R. Ramachandra, K. Raja, and C. Busch, “Algorithmic fairness in face morphing attack detection,” Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, pp. 410–418, January 2022, Hawaii, USA.
- M. Fang, W. Yang, A. Kuijper, V. Struc, and N. Damer, “Fairness in face presentation attack detection,” Pattern Recognition, vol. 147, p. 110002, October 2023.
- L. Trinh and Y. Liu, “An examination of fairness of ai models for deepfake detection,” Proceedings of the International Joint Conference on Artificial Intelligence, pp. 567–574, August 2021, Montreal, Canada.
- A. V. Nadimpalli and A. Rattani, “Gbdf: Gender balanced deepfake dataset towards fair deepfake detection,” arXiv preprint arXiv:2207.10246, July 2022.
- Y. Ju, S. Hu, S. Jia, G. H. Chen, and S. Lyu, “Improving fairness in deepfake detection,” Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 4655–4665, January 2024, Hawaii, USA.
- D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation,” Proceedings of Machine Learning Research, vol. 166, pp. 1–24, Dec 2022.
- J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6367–6371, May 2022, Singapore.
- W. Ge, J. Patino, M. Todisco, and N. Evans, “Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection,” Proceedings of the ISCA Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 22–28, 2021.
- A. Nautsch, X. Wang, N. Evans, T. H. Kinnunen, V. Vestman, M. Todisco, H. Delgado, M. Sahidullah, J. Yamagishi, and K. A. Lee, “ASVspoof 2019: Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 2, pp. 252–265, February 2021.
- A. K. S. Yadav, Z. Xiang, K. Bhagtani, P. Bestagini, S. Tubaro, and E. J. Delp, “Compression Robust Synthetic Speech Detection Using Patched Spectrogram Transformer,” arXiv:2402.14205, February 2024.
- I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” Proceedings of the International Conference on Learning Representations, May 2019, New Orleans, LA.
- O. Tange, “Gnu parallel - the command-line power tool,” ;login: The USENIX Magazine, vol. 36, no. 1, pp. 42–47, Feb 2011. [Online]. Available: http://www.gnu.org/s/parallel