Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators (2402.01708v2)

Published 25 Jan 2024 in cs.CL, cs.AI, cs.CY, and eess.AS

Abstract: The rapid and wide-scale adoption of AI to generate human speech poses a range of significant ethical and safety risks to society that need to be addressed. For example, a growing number of speech generation incidents are associated with swatting attacks in the United States, where anonymous perpetrators create synthetic voices that call police officers to close down schools and hospitals, or to violently gain access to innocent citizens' homes. Incidents like this demonstrate that multimodal generative AI risks and harms do not exist in isolation, but arise from the interactions of multiple stakeholders and technical AI systems. In this paper we analyse speech generation incidents to study how patterns of specific harms arise. We find that specific harms can be categorised according to the exposure of affected individuals, that is to say whether they are a subject of, interact with, suffer due to, or are excluded from speech generation systems. Similarly, specific harms are also a consequence of the motives of the creators and deployers of the systems. Based on these insights we propose a conceptual framework for modelling pathways to ethical and safety harms of AI, which we use to develop a taxonomy of harms of speech generators. Our relational approach captures the complexity of risks and harms in sociotechnical AI systems, and yields a taxonomy that can support appropriate policy interventions and decision making for the responsible development and release of speech generation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. AIAAIC. 2024. The AI, Algorithmic, and Automation Incidents Database. {https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-automation-incidents}
  2. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. Association for Computational Linguistics, 5723–5738. https://github.com/microsoft/
  3. Julia Barnett. 2023. The Ethical Implications of Generative Audio Models: A Systematic Literature Review. In AI, Ethics and Society (AIES). Association for Computing Machinery (ACM), 146–161. https://doi.org/10.1145/3600211.3604686
  4. On the dangers of stochastic parrots: Can language models be too big?. In FAccT 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, Inc, 610–623. https://doi.org/10.1145/3442188.3445922
  5. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. In ACM International Conference Proceeding Series. Association for Computing Machinery, 1493–1504. https://doi.org/10.1145/3593013.3594095
  6. Introduction to Speech Processing (2 ed.). https://doi.org/10.5281/zenodo.6821775
  7. Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content. In Conference on Human Factors in Computing Systems - Proceedings. Association for Computing Machinery. https://doi.org/10.1145/3313831.3376789
  8. Julia Cambre and Chinmay Kulkarni. 2019. One voice fits all? Social implications and research challenges of designing voices for smart devices. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (11 2019). https://doi.org/10.1145/3359325
  9. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. (12 2021). http://arxiv.org/abs/2112.02418
  10. Sasanka Sekhar Chanda and Debarag Narayan Banerjee. 2022. Omission and commission errors underlying AI failures. AI and Society (2022). https://doi.org/10.1007/s00146-022-01585-x
  11. Recommendations for environmental risk assessment of gene drive applications for malaria vector control. Malaria Journal 21, 1 (12 2022). https://doi.org/10.1186/s12936-022-04183-w
  12. Joseph Cox. 2023. A Computer Generated Swatting Service Is Causing Havoc Across America. {https://www.vice.com/en/article/k7z8be/torswats-computer-generated-ai-voice-swatting}
  13. Andrew Critch and Stuart Russell. 2023. TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI. (2023).
  14. Hans de Bruijn and Paulien M. Herder. 2009. System and actor perspectives on sociotechnical systems. IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans 39, 5 (2009), 981–992. https://doi.org/10.1109/TSMCA.2009.2025452
  15. Shirley Gregor and Alan R. Hevner. 2013. Positioning and presenting design science research for maximum impact. MIS Quarterly: Management Information Systems 37, 2 (2013), 337–355. https://doi.org/10.25300/MISQ/2013/37.2.01
  16. Promptts: Controllable Text-to-Speech with Text Descriptions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://speechresearch.github.io/prompttts/
  17. Jennifer Hassan. 2023. AI is being used to give dead, missing kids a voice they didn’t ask for. {https://www.washingtonpost.com/technology/2023/08/09/ai-dead-children-tiktok-videos/}
  18. CSET AI Harm Taxonomy for AIID and Annotation Guide. Technical Report. Center for Security and Emerging Technology.
  19. ISO/IEC/IEEE 24748-7000:2022(en) 2022. Systems and software engineering — Life cycle management — Part 7000: Standard model process for addressing ethical concerns during system design. Standard. International Organization for Standardization.
  20. TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models. (2023). https://sall-e.github.io/.
  21. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In International Conference on Machine Learning (ICML). https://jaywalnut310.github.io/vits-demo/index.html
  22. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. (2023).
  23. Deepfakes, Phrenology, Surveillance, and More! A Taxonomy of AI Privacy Risks. (2023).
  24. Ethics of Singing Voice Synthesis: Perceptions of Users and Developers. In Proc. of the 23rd Int. Society for Music Information Retrieval Conf. https://osf.io/7em95/.
  25. Promptts 2: Describing and Generating Voices with Text Prompt. (2023). https://speechresearch.github.io/prompttts2
  26. PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions. In Proc. INTERSPEECH 2023. https://doi.org/10.21437/Interspeech.2023-1779
  27. Stable Bias: Analyzing Societal Representations in Diffusion Models. (3 2023). http://arxiv.org/abs/2303.11408
  28. Sean McGregor. 2021. Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21). 15458–15463. www.aaai.org
  29. Leah Nylen. 2023. FTC’s Khan Says Enforcers Need to Be ‘Vigilant Early’ With AI. {https://www.bloomberg.com/news/articles/2023-06-02/ftc-s-khan-says-enforcers-need-to-be-vigilant-early-with-ai}
  30. OECD. 2023. Stocktaking for the Development of an AI Incident Definition. Technical Report. OECD Publishing, Paris. http://www.oecd.org/termsandconditions.
  31. OECD.AI Policy Observatory. 2024. The OECD AI Incidents Monitor. {https://oecd.ai/en/incidents}
  32. Nikiforos Pittaras and Sean McGregor. 2023. A taxonomic system for failure cause analysis of open source AI incidents. Technical Report. http://ceur-ws.org
  33. The Fallacy of AI Functionality. In ACM International Conference Proceeding Series. Association for Computing Machinery, 959–972. https://doi.org/10.1145/3531146.3533158
  34. FastSpeech 2: Fast and High Quality End-to-End Text to Speech. In International Conference on Learning Representation (ICLR). https://speechresearch.github.io/fastspeech2/.
  35. Responsible AI Collaborative. 2024. The AI Incident Database. {https://incidentdatabase.ai/}
  36. Marc Schröder. 2001. Emotional speech synthesis: A review. In EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology. International Speech Communication Association, 561–564. https://doi.org/10.21437/eurospeech.2001-150
  37. ”Human, All Too Human”: NOAA Weather Radio and the Emotional Impact of Synthetic Voices. In Conference on Human Factors in Computing Systems - Proceedings. Association for Computing Machinery. https://doi.org/10.1145/3313831.3376338
  38. The Bias Amplification Paradox in Text-to-Image Generation. (2023).
  39. Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES). https://doi.org/10.1145/3600211.3604673
  40. Daniel J. Solove. 2006. A Taxonomy of Privacy. University of Pennsylvania Law Review 154, 3 (2006), 477–564. http://www.jstor.org/stable/40041279
  41. A Survey on Neural Speech Synthesis. (6 2021). http://arxiv.org/abs/2106.15561
  42. An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era. Proc. IEEE (10 2022). https://doi.org/10.1109/JPROC.2023.3250266
  43. Lakshmi Varanasi. 2023. Biden’s AI chief says ’voice cloning’ is what keeps him up at night. {https://www.businessinsider.com/voice-cloning-technology-worries-biden-ai-bruce-reed-elevenlabs-scammers-2023-11}
  44. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. (1 2023). http://arxiv.org/abs/2301.02111
  45. VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation. (5 2023). http://arxiv.org/abs/2305.16107
  46. Sociotechnical Safety Evaluation of Generative AI Systems. (2023).
  47. Taxonomy of Risks posed by Language Models. In ACM Conference on Fairness, Accountability and Transparency (FAccT). Association for Computing Machinery, 214–229. https://doi.org/10.1145/3531146.3533088
  48. Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling. Technical Report. https://github.com/microsoft/unilm
  49. Emotional voice conversion: Theory, databases and ESD. Speech Communication 137 (2 2022), 1–18. https://doi.org/10.1016/j.specom.2021.11.006
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wiebke Hutiri (6 papers)
  2. Oresiti Papakyriakopoulos (1 paper)
  3. Alice Xiang (28 papers)
Citations (5)