Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response (2402.18085v3)

Published 28 Feb 2024 in cs.SD, cs.CR, and eess.AS

Abstract: The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enroLLMent-based authentication. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. These were tested against leading voice-cloning systems using a novel dataset comprising 18,600 original and 1.6 million deepfake samples from 100 users. PITCH's prospective challenges enhanced machine detection capabilities to 88.7% AUROC score on the full unbalanced dataset, enabling us to shortlist 10 functional challenges that balance security and usability. For human evaluation and subsequent analyses, we filtered a challenging, balanced subset. On this subset, human evaluators independently scored 72.6% accuracy, while machines achieved 87.7%. Acknowledging that call environments require higher human control, we aided call receivers in making decisions with them using machines. Our solution uses an early warning system to tag suspicious incoming calls as "Deepfake-likely." Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages. Our solution gave users maximum control and boosted detection accuracy to 84.5%. Evidenced by this jump in accuracy, PITCH demonstrated the potential for AI-assisted pre-screening in call verification processes, offering an adaptable and usable approach to combat real-time voice-cloning attacks. Code to reproduce and access data at \url{https://github.com/mittalgovind/PITCH-Deepfakes}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Adversarial Perturbations of Deep Neural Networks, pages 311–342. 2017.
  2. Caller id spoofing: How to spot and avoid spoofed calls. Norton Blog, 2023. [Accessed: 23-Nov-2023].
  3. How does biometrics voice recognition work? KYCAML Guide Blog, Jan 2023. Accessed: 2024-02-15.
  4. Voice deepfakes are coming for your bank balance. The New York Times, Aug 2023. Accessed: 2024-02-15.
  5. Will generative ai kill kyc authentication? CSO Online, Oct 2023. Accessed: 2024-02-15.
  6. https://www.cnn.com/2024/02/04/asia/deepfake-cfo-scam-hong-kong-intl-hnk/index.html, Feb 2024. Accessed: 2024-02-15.
  7. Fake joe biden robocall tells new hampshire democrats not to vote tuesday. https://www.nbcnews.com/politics/2024-election/fake-joe-biden-robocall-tells-new-hampshire-democrats-not-vote-tuesday-rcna134984, Feb 2024. Accessed: 2024-02-15.
  8. APNews. Can New York’s mayor speak Mandarin? No, but with AI he’s making robocalls in different languages. https://apnews.com/article/nyc-mayor-ai-robocalls-foreign-languages-30517885466994e5f1f54745c08691e0. [Accessed: 23-Nov-2023].
  9. Voice conversion with just nearest neighbors. In Interspeech, 2023.
  10. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020.
  11. Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
  12. James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
  13. Voice biometrics: Deep learning-based voiceprint authentication system. In 2017 12th System of Systems Engineering Conference (SoSE), pages 1–6. IEEE, 2017.
  14. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  15. Domain adaptation for speaker recognition in singing and spoken voice. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7192–7196, 2022.
  16. CNN. ’Mom, these bad men have me’: She believes scammers cloned her daughter’s voice in a fake kidnapping. Online, 2023. [Accessed: 23-Nov-2023].
  17. Restricted black-box adversarial attack against deepfake face swapping. IEEE Transactions on Information Forensics and Security, 2023.
  18. Xu Tan Rongjie Huang Songxiang Liu Xuankai Chang Jiatong Shi Sheng Zhao Jiang Bian Xixin Wu Zhou Zhao Helen Meng Dongchao Yang, Jinchuan Tian. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
  19. Blind and human: Exploring more usable audio {{\{{CAPTCHA}}\}} designs. In Sixteenth Symposium on Usable Privacy and Security (SOUPS 2020), pages 111–125, 2020.
  20. Federal Communications Commission. Fcc makes ai-generated voices in robocalls illegal. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal, Feb 2024. Accessed: 2024-02-15.
  21. Internet Crime Complaint Center (IC3). Malicious Actors Almost Certainly Will Leverage Synthetic Content for Cyber and Foreign Influence Operations. Online, 2021. [Accessed: 23-Nov-2023].
  22. Wall Street Journal. Fraudsters used ai to mimic ceo’s voice in unusual cybercrime case. https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402, 2019. [Accessed: 23-Nov-2023].
  23. Vulnerability of automatic identity recognition to audio-visual deepfakes. 2023.
  24. Audiogen: Textually guided audio generation. In The Eleventh International Conference on Learning Representations, 2023.
  25. Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  26. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. In Advances in Neural Information Processing Systems, 2023.
  27. StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. In Proc. Interspeech 2021, pages 1349–1353, 2021.
  28. Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1717–1728, 2021.
  29. McAfee. Beware the Artificial Impostor. https://www.mcafee.com/content/dam/consumer/en-us/resources/cybersecurity/artificial-intelligence/rp-beware-the-artificial-impostor-report.pdf. [Accessed: 23-Nov-2023].
  30. Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.
  31. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. In Interspeech 2021. ISCA, 2021.
  32. Gotcha: A challenge-response system for real-time deepfake detection. arXiv preprint arXiv:2210.06186, 2022.
  33. From wer and ril to mer and wil: improved evaluation measures for connected speech recognition. 10 2004.
  34. Human perception of audio deepfakes. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, pages 85–91, 2022.
  35. Asvspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(2):252–265, 2021.
  36. NPR. That panicky call from a relative? it could be a thief using a voice clone, ftc warns. https://www.gpb.org/news/2023/03/22/panicky-call-relative-it-could-be-thief-using-voice-clone-ftc-warns, 2023. [Accessed: 23-Nov-2023].
  37. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
  38. Bias and statistical significance in evaluating speech synthesis with mean opinion scores. In Interspeech, pages 3976–3980, 2017.
  39. suno-ai. Bark: Text-Prompted Generative Audio Model. https://github.com/suno-ai/bark, 2023. Accessed: 2024-02-15.
  40. End-to-end anti-spoofing with rawnet2. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6369–6373. IEEE, 2021.
  41. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  42. "hello, it’s me": Deep learning-based speech synthesis attacks in the real world. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 235–251, New York, NY, USA, 2021. Association for Computing Machinery.
  43. AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios. In Proc. Interspeech 2022, pages 2568–2572, 2022.
  44. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  45. Streamvc: Real-time low-latency voice conversion. 2024.
  46. Deepfake captcha: A method for preventing fake calls. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security, ASIA CCS ’23, page 608–622, New York, NY, USA, 2023. Association for Computing Machinery.
  47. Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970, 2023.
  48. A phoneme localization based liveness detection for text-independent speaker verification. IEEE Transactions on Mobile Computing, pages 1–14, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Govind Mittal (8 papers)
  2. Arthur Jakobsson (3 papers)
  3. Kelly O. Marshall (6 papers)
  4. Chinmay Hegde (109 papers)
  5. Nasir Memon (35 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com