Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Privacy against Real-Time Speech Emotion Detection via Acoustic Adversarial Evasion of Machine Learning (2211.09273v4)

Published 17 Nov 2022 in cs.LG, cs.CR, cs.SD, and eess.AS

Abstract: Smart speaker voice assistants (VAs) such as Amazon Echo and Google Home have been widely adopted due to their seamless integration with smart home devices and the Internet of Things (IoT) technologies. These VA services raise privacy concerns, especially due to their access to our speech. This work considers one such use case: the unaccountable and unauthorized surveillance of a user's emotion via speech emotion recognition (SER). This paper presents DARE-GP, a solution that creates additive noise to mask users' emotional information while preserving the transcription-relevant portions of their speech. DARE-GP does this by using a constrained genetic programming approach to learn the spectral frequency traits that depict target users' emotional content, and then generating a universal adversarial audio perturbation that provides this privacy protection. Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment. Further, this evasion is robust against defenses employed by a knowledgeable adversary. The evaluations in this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers using a small-form-factor (raspberry pi) integrated with a wake-word system to evaluate the efficacy of its real-world, real-time deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (121)
  1. 2022. Normal conversation loudness. https://tinyurl.com/9r4xrz24
  2. Universal adversarial audio perturbations. arXiv preprint arXiv:1908.03173 (2019).
  3. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In 2021 IEEE symposium on security and privacy (SP). IEEE, 730–747.
  4. Nelson Aguilar. 2022. Ultimate Alexa Command Guide: 200+ Voice Commands You Need to Know for Your Echo. Retrieved February 9, 2023 from https://www.cnet.com/home/smart-home/ultimate-alexa-command-guide-200-voice-commands-you-need-to-know-for-your-echo/
  5. Emotion filtering at the edge. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems. 1–6.
  6. Emotionless: Privacy-preserving speech analysis for voice assistants. arXiv preprint arXiv:1908.03632 (2019).
  7. Paralinguistic privacy protection at the edge. ACM Transactions on Privacy and Security (2020).
  8. Privacy-preserving voice analysis via disentangled representations. In Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop. 1–14.
  9. Nourah Alswaidan and Mohamed El Bachir Menai. 2020. A survey of state-of-the-art approaches for emotion recognition in text. Knowledge and Information Systems 62, 8 (2020), 2937–2987.
  10. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554 (2018).
  11. Amazon. 2022. What Is Automatic Speech Recognition? Retrieved May 14, 2022 from https://developer.amazon.com/en-US/alexa/alexa-skills-kit/asr
  12. Music, Search, and IoT: How People (Really) Use Voice Assistants. ACM Trans. Comput. Hum. Interact. 26, 3 (2019), 17–1.
  13. Impact of Personalized Social Media Advertising on Online Impulse Buying Behavior. SEISENSE Business Review 1, 3 (2021), 12–25.
  14. Americans and privacy: Concerned, confused and feeling lack of control over their personal information. (2019).
  15. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  16. Emotion classification based on biophysical signals and machine learning techniques. Symmetry 12, 1 (2019), 21.
  17. Nick Barney and Ivy Wigmore. 2022. What is surveillance capitalism? - definition from whatis.com. https://www.techtarget.com/whatis/definition/surveillance-capitalism
  18. Understanding the long-term use of smart speaker assistants. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 3 (2018), 1–24.
  19. Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information. US Patent 5,918,223.
  20. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.
  21. Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE security and privacy workshops (SPW). IEEE, 1–7.
  22. Query-efficient hard-label black-box attack: An optimization-based approach. arXiv preprint arXiv:1807.04457 (2018).
  23. Sign-opt: A query-efficient hard-label adversarial attack. arXiv preprint arXiv:1909.10773 (2019).
  24. Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.
  25. Lawrence J Christiano and Terry J Fitzgerald. 2003. The band pass filter. international economic review 44, 2 (2003), 435–465.
  26. Michael Cowling and Renate Sitte. 2003. Comparison of techniques for environmental sound recognition. Pattern recognition letters 24, 15 (2003), 2895–2907.
  27. How impulse buying influences compulsive buying: The central role of consumer anxiety and escapism. Journal of Retailing and Consumer Services 31 (2016), 103–108.
  28. data flair.training. 2019. Python Mini Project – Speech Emotion Recognition with librosa. Retrieved May 14, 2022 from https://data-flair.training/blogs/python-mini-project-speech-emotion-recognition/
  29. Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In 28th USENIX security symposium. USENIX Association, 321–338.
  30. Basics of sound, the ear, and hearing. In Hearing Loss: Determining Eligibility for Social Security Benefits. National Academies Press (US).
  31. Kate Dupuis and M Kathleen Pichora-Fuller. 2010. Toronto emotional speech set (TESS)-Younger talker_Happy. (2010).
  32. Alex Engler. 2021. Why President Biden should ban affective computing in federal law enforcement. Retrieved May 14, 2022 from https://www.brookings.edu/blog/techtank/2021/08/04/why-president-biden-should-ban-affective-computing-in-federal-law-enforcement/
  33. A cross-linguistic comparison of perception to formant frequency cues in emotional speech. COCOSDA, Kyoto, Japan (2008), 163–167.
  34. Audio-based context recognition. IEEE Transactions on Audio, Speech, and Language Processing 14, 1 (2005), 321–329.
  35. Normalized mutual information feature selection. IEEE Transactions on neural networks 20, 2 (2009), 189–201.
  36. Tatiana Zambrano Filomensky and Hermano Tavares. 2021. Compulsive buying disorder. Textbook of Addiction Treatment: International Perspectives (2021), 979–994.
  37. Nico H Frijda. 1988. The Laws of Emotion. (1988).
  38. Sidney Fussell. 2019. Alexa wants to know how you’re feeling today. https://www.theatlantic.com/technology/archive/2018/10/alexa-emotion-detection-ai-surveillance/572884/
  39. Black-box adversarial attacks through speech distortion for speech emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing 2022, 1 (2022), 1–10.
  40. DAPPER: Label-Free Performance Estimation after Personalization for Heterogeneous Mobile Sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7, 2 (2023), 1–27.
  41. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  42. Google. 2022. Analyzing Sentiment — Cloud Natural Language API — Google Cloud. Retrieved November 14, 2022 from https://cloud.google.com/natural-language/docs/analyzing-sentiment
  43. Fiona Hamilton. 2020. Police facial recognition robot identifies anger and distress. Retrieved February 8, 2023 from https://www.thetimes.co.uk/article/police-facial-recognition-robot-identifies-anger-and-distress-65h0xfrkg
  44. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
  45. {{\{{WaveGuard}}\}}: Understanding and Mitigating Audio Adversarial Examples. In 30th USENIX Security Symposium (USENIX Security 21). 2273–2290.
  46. Black-box adversarial attacks with limited queries and information. In International conference on machine learning. PMLR, 2137–2146.
  47. Philip Jackson and SJUoSG Haq. 2014. Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK (2014).
  48. Huafeng Jin and Shuo Wang. 2018. Voice-based determination of physical and emotional characteristics of users. US Patent 10,096,319.
  49. Kosta Jovanovic. 2021. GitHub - Data-Science-kosta/Speech-Emotion-Classification-with-PyTorch. Retrieved May 14, 2022 from https://github.com/Data-Science-kosta/Speech-Emotion-Classification-with-PyTorch
  50. A study of intentional voice modifications for evading automatic speaker recognition. In 2006 IEEE Odyssey-The Speaker and Language Recognition Workshop. IEEE, 1–6.
  51. Understanding emotions. Wiley Hoboken, NJ.
  52. Amy Klobuchar. 2020. Following Privacy Concerns Surrounding Amazon Halo, Klobuchar Urges Administration to Take Action to Protect Personal Health Data. Retrieved May 14, 2022 from https://www.klobuchar.senate.gov/public/index.cfm/2020/12/following-privacy-concerns-surrounding-amazon-halo-klobuchar-urges-administration-to-take-action-to-protect-personal-health-data
  53. Detecting emotion primitives from speech and their use in discerning categorical emotions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7164–7168.
  54. John R Koza and Riccardo Poli. 2005. Genetic programming. In Search methodologies. Springer, 127–164.
  55. Impulse buying and post-purchase regret: a study of shopping behavior for the purchase of grocery products. Abhishek Kumar, Sumana Chaudhuri, Aparna Bhardwaj and Pallavi Mishra, Emotional Intelligence and its Impact on Team Building through Mediation of Leadership Effectiveness, International Journal of Management 11, 12 (2021), 2020.
  56. John Laidler. 2019. Harvard professor says surveillance capitalism is undermining democracy. https://news.harvard.edu/gazette/story/2019/03/harvard-professor-says-surveillance-capitalism-is-undermining-democracy/
  57. Rhue Lauren. 2019. Emotion-reading tech fails the racial bias test. Retrieved Fec 14, 2023 from https://theconversation.com/emotion-reading-tech-fails-the-racial-bias-test-108404
  58. Richard S Lazarus. 1991. Emotion and adaptation. Oxford University Press.
  59. Li Lee and Richard Rose. 1998. A frequency warping approach to speaker normalization. IEEE Transactions on speech and audio processing 6, 1 (1998), 49–60.
  60. Timothy G Leighton. 2016. Are some people suffering as a result of increasing mass exposure of the public to ultrasound in air? Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 472, 2185 (2016), 20150624.
  61. Emotion and decision making. Annual review of psychology 66 (2015), 799–823.
  62. Patronus: Preventing unauthorized speech recordings with support for selective unscrambling. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 245–257.
  63. Practical adversarial attacks against speaker recognition systems. In Proceedings of the 21st international workshop on mobile computing systems and applications. 9–14.
  64. Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1121–1134.
  65. Mark Lippett. 2023. Council post: How can we make the smart speaker feel at home? https://www.forbes.com/sites/forbestechcouncil/2023/03/06/how-can-we-make-the-smart-speaker-feel-at-home/?sh=4e80ef0d689a
  66. Physical-World Attack towards WiFi-based Behavior Recognition. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 400–409.
  67. Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 13, 5 (2018), e0196391.
  68. Beth Logan. 2000. Mel frequency cepstral coefficients for music modeling. In In International Symposium on Music Information Retrieval. Citeseer.
  69. Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).
  70. Thomas Macaulay. 2020. British police to trial facial recognition system that detects your mood. Retrieved May 14, 2022 from https://thenextweb.com/news/british-police-to-trial-facial-recognition-system-that-detects-your-mood
  71. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  72. Multimodal emotion recognition with high-level speech and text features. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 350–357.
  73. Vijini Mallawaarachchi. 2017. Introduction to Genetic Algorithms — Including Example Code. Retrieved May 14, 2022 from https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3
  74. Facial Expression Recognition: Impact of Gender on Fairness and Expressions*. In Proceedings of the XXII International Conference on Human Computer Interaction. 1–8.
  75. Stephen Edward McAdams. 1984. Spectral fusion, spectral parsing and the formation of auditory images. Stanford university.
  76. Andrew McStay. 2020. Emotional AI, soft biometrics and the surveillance of emotional life: An unusual consensus on privacy. Big Data & Society 7, 1 (2020), 2053951720904386.
  77. Eric Mikulin. 2022. GitHub - Picovoice/porcupine: On-device wake word detection powered by deep learning. Retrieved June 20, 2022 from https://github.com/Picovoice/porcupine
  78. Automatic speech recognition: An auditory perspective. In Speech processing in the auditory system. Springer, 309–338.
  79. Stanford NLP. 2022. Sentiment Analysis API — DeepAI. Retrieved November 14, 2022 from https://deepai.org/machine-learning-model/sentiment-analysis
  80. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
  81. Mitesh Puthran. 2021. GitHub - MiteshPuthran/Speech-Emotion-Analyzer: The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python). Retrieved October 20, 2022 from https://github.com/MiteshPuthran/Speech-Emotion-Analyzer
  82. Towards an optimal feature set for robustness improvement of sounds classification in a HMM-based classifier adapted to real world background noise. In Proc. 4th Int. Multi-Conf. Systems, Signals & Devices.
  83. Amirhossein Rajabi and Carsten Witt. 2021. Stagnation detection with randomized local search. In European Conference on Evolutionary Computation in Combinatorial Optimization (Part of EvoStar). Springer, 152–168.
  84. James A Roberts and Camille Roberts. 2012. Stress, gender and compulsive buying among early adolescents. Young Consumers (2012).
  85. Factors affecting impulse buying behavior of consumers. Frontiers in Psychology 12 (2021), 697080.
  86. Jackie Salo. 2021. China using ‘emotion recognition technology’ for surveillance. Retrieved May 14, 2022 from https://nypost.com/2021/03/04/china-using-emotion-recognition-technology-in-surveillance/
  87. Bias and Fairness on Multimodal Emotion Detection Algorithms. arXiv preprint arXiv:2205.08383 (2022).
  88. Imperio: Robust over-the-air adversarial examples for automatic speech recognition systems. In Annual Computer Security Applications Conference. 843–855.
  89. Eric Hal Schwartz. 2021. Microsoft patents AI Emotion Detection System for xbox. https://voicebot.ai/2021/09/15/microsoft-patents-ai-emotion-detection-system/
  90. Roger N Shepard. 1964. Circularity in judgments of relative pitch. The journal of the acoustical society of America 36, 12 (1964), 2346–2353.
  91. Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764 (2020).
  92. Katherine Snow Smith. 2020. Smart speakers offer new legal challenges as privacy goes public. Retrieved Feb 7, 2023 from https://www.legalexaminer.com/home-family/smart-speakers-offer-new-legal-challenges-as-privacy-goes-public/
  93. Vocal indicators of emotional stress. International Journal of Computer Applications 122, 15 (2015), 38–43.
  94. Cary R Spitzer and Leanna Rierson. 2017. RTCA DO-297/EUROCAE ED-124 Integrated Modular Avionics (IMA) Design Guidance and Certification Considerations. In Digital Avionics Handbook. CRC Press, 614–623.
  95. Reviews.com Staff. 2021. The best voice assistant — ZDNET. Retrieved February 8, 2023 from https://www.zdnet.com/home-and-office/smart-home/the-best-voice-assistant/
  96. Comparison of different impulse response measurement techniques. Journal of the Audio engineering society 50, 4 (2002), 249–262.
  97. A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america 8, 3 (1937), 185–190.
  98. ” Alexa, stop spying on me!” speech privacy protection against voice assistants. In Proceedings of the 18th conference on embedded networked sensor systems. 298–311.
  99. Improving Speech Emotion Recognition via Fine-tuning ASR with Speaker Information. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1–6.
  100. A taxonomy and terminology of adversarial machine learning. NIST IR (2019), 1–29.
  101. The NAIST text-to-speech system for the Blizzard Challenge 2015. In Proc. Blizzard Challenge workshop, Vol. 2. Berlin, Germany.
  102. Targeted adversarial examples for black box audio systems. In 2019 IEEE security and privacy workshops (SPW). IEEE, 15–20.
  103. George Terzopoulos and Maya Satratzemi. 2020. Voice assistants and smart speakers in everyday life and in education. Informatics in Education 19, 3 (2020), 473–490.
  104. The role of F3 in the vocal expression of emotions. Logopedics Phoniatrics Vocology 31, 4 (2006), 153–156.
  105. Monopitched expression of emotions in different vowels. Folia Phoniatrica et Logopaedica 60, 5 (2008), 249–255.
  106. Najah Walker. 2022. Spotify patented emotional recognition technology to recommend songs based on user’s emotions. https://jolt.richmond.edu/2022/01/11/spotify-patented-emotional-recognition-technology-to-recommend-songs-based-on-users-emotions/
  107. Julian Wallis. 2022. The Tech behind Amazon alexa. https://webo.digital/blog/the-tech-behind-amazon-alexa/
  108. Martin Weik. 2012. Communications standard dictionary. Springer Science & Business Media.
  109. Automatically evading classifiers: A case study on PDF malware classifiers. In Proceedings of the 23rd Annual Network and Distributed System Security Symposium.
  110. Compulsive buying—features and characteristics of addiction. In Neuropathology of drug addictions and substance misuse. Elsevier, 993–1007.
  111. On the acoustics of emotion in audio: what speech, music, and sound have in common. Frontiers in psychology 4 (2013), 292.
  112. WIRED. 2020. Meet the Star Witness: Your Smart Speaker. Retrieved Fec 14, 2023 from https://www.wired.com/story/star-witness-your-smart-speaker/
  113. Chung-Hsien Wu and Chia-Hsin Hsieh. 2006. Multiple change-point audio segmentation and classification using an MDL-based Gaussian model. IEEE Transactions on Audio, Speech, and Language Processing 14, 2 (2006), 647–657.
  114. Semi-black-box attacks against speech recognition systems using adversarial samples. In 2019 IEEE International symposium on dynamic spectrum access networks (DySPAN). IEEE, 1–5.
  115. Speech emotion recognition with multiscale area attention and data augmentation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6319–6323.
  116. Investigating bias and fairness in facial expression recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer, 506–523.
  117. Soo Youn. 2019. Alexa is always listening — and so are Amazon workers - ABC News. Retrieved February 8, 2023 from https://abcnews.go.com/Technology/alexa-listening-amazon-workers/story?id=62331191
  118. {{\{{SMACK}}\}}: Semantically Meaningful Adversarial Audio Attack. In 32nd USENIX Security Symposium (USENIX Security 23). 3799–3816.
  119. Dolphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 103–117.
  120. Perceived stress and online compulsive buying among women: A moderated mediation model. Computers in Human Behavior 103 (2020), 13–20.
  121. Shoshana Zuboff. 2015. Big other: surveillance capitalism and the prospects of an information civilization. Journal of information technology 30, 1 (2015), 75–89.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Brian Testa (5 papers)
  2. Yi Xiao (49 papers)
  3. Harshit Sharma (14 papers)
  4. Avery Gump (2 papers)
  5. Asif Salekin (14 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.