Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation (2405.09708v1)

Published 15 May 2024 in cs.RO, cs.AI, and stat.CO

Abstract: Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Robinson et al., “Designing sound for social robots: Candidate design principles,” International Journal of Social Robotics, vol. 14, no. 6, pp. 1507–1525, 2022.
  2. K. Ishikawa, S. Boyce, L. Kelchner, M. G. Powell, et al., “The effect of background noise on intelligibility of dysphonic speech,” Journal of Speech, Language, and Hearing Research, vol. 60, no. 7, pp. 1919–1929, 2017.
  3. M. Klatte et al., “Effects of noise and reverberation on speech perception and listening comprehension of children and adults in a classroom-like setting,” Noise and Health, vol. 12, no. 49, p. 270, 2010.
  4. Walters and o, “Human approach distances to a mechanical-looking robot with different robot voice styles,” in Proc. of RO-MAN, 2008, pp. 707–712.
  5. W. Apple et al., “Effects of pitch and speech rate on personal attributions.” Journal of personality and social psychology, vol. 37, no. 5, p. 715, 1979.
  6. A. Niculescu, B. van Dijk, A. Nijholt, H. Li, and S. L. See, “Making social robots more attractive: the effects of voice pitch, humor and empathy,” International Journal of Social Robotics, vol. 5, pp. 171–191, 2013.
  7. C. Jones, L. Berry, and C. Stevens, “Synthesized speech intelligibility and persuasion: Speech rate and non-native listeners,” Computer Speech & Language, vol. 21, no. 4, pp. 641–651, 2007.
  8. A. Heinrich, H. Henshaw, and M. A. Ferguson, “The relationship of speech intelligibility with hearing sensitivity, cognition, and perceived hearing difficulties varies for different speech perception tests,” Frontiers in psychology, vol. 6, p. 782, 2015.
  9. E. Martinson and D. Brock, “Improving human-robot interaction through adaptation to the auditory scene,” in Proceedings of ACM/IEEE HRI, 2007, pp. 113–120.
  10. G. Lindley, “Adaptation to loudness: Implications for hearing aid fittings.” The Hearing Journal, vol. 52, no. 11, pp. 50–52, 1999.
  11. K. Fischer, L. Naik, et al., “Initiating human-robot interactions using incremental speech adaptation,” in Proceedings of ACM/IEEE HRI, 2021, pp. 421–425.
  12. J. R. Dubno, D. D. Dirks, and D. E. Morgan, “Effects of age and mild hearing loss on speech recognition in noise,” The Journal of the Acoustical Society of America, vol. 76, no. 1, pp. 87–96, 1984.
  13. M. L. G. Lecumberri et al., “Non-native speech perception in adverse conditions: A review,” Speech communication, vol. 52, no. 11-12, pp. 864–886, 2010.
  14. T. M. Mikkola, H. Polku, E. Portegijs, M. Rantakokko, T. Rantanen, and A. Viljanen, “Self-reported hearing status is associated with lower limb physical performance, perceived mobility, and activities of daily living in older community-dwelling men and women,” Journal of the American Geriatrics Society, vol. 63, no. 6, pp. 1164–1169, 2015.
  15. Y. Kali, M. Saad, J.-F. Boland, J. Fortin, and V. Girardeau, “Walking task space control using time delay estimation based sliding mode of position controlled nao biped robot,” International Journal of Dynamics and Control, vol. 9, pp. 679–688, 2021.
  16. Tozadore et al., “Wizard of oz vs autonomous: Children’s perception changes according to robot’s operation condition,” in IEEE RO-MAN, 2017, pp. 664–669.
  17. Paini et al., “Is reverberation time adequate for testing the acoustical quality of unroofed auditoriums?” Proceedings of Ins. Ac., vol. 28, no. 2, pp. 66–73, 2006.
  18. M. S. Dobreva, W. E. O’Neill, and G. D. Paige, “Influence of aging on human sound localization,” Journal of neurophysiology, vol. 105, no. 5, pp. 2471–2486, 2011.
  19. D. Västfjäll, “Influences of current mood and noise sensitivity on judgments of noise annoyance,” The Journal of psychology, vol. 136, no. 4, pp. 357–370, 2002.
  20. A. Mitchell, M. Erfanian, C. Soelistyo, T. Oberman, et al., “Effects of soundscape complexity on urban noise annoyance ratings: A large-scale online listening experiment,” International Journal of Environmental Research and Public Health, vol. 19, no. 22, p. 14872, 2022.
  21. R. Sharma, K. Dhawan, and B. Pailla, “Phonetic word embeddings,” arXiv preprint arXiv:2109.14796, 2021.
  22. J. A. Nelder et al., “Generalized linear models,” Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 135, no. 3, pp. 370–384, 1972.
  23. Moran et al., “New models for old questions: generalized linear models for cost prediction,” Journal of evaluation in clinical practice, vol. 13, no. 3, pp. 381–389, 2007.
  24. P. Green and C. J. MacLeod, “Simr: An r package for power analysis of generalized linear mixed models by simulation,” Methods in Ecology and Evolution, vol. 7, no. 4, pp. 493–498, 2016.
  25. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM TASLP, vol. 28, pp. 2880–2894, 2020.
  26. J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with relu activation function,” The Annals of Statistics, vol. 48, no. 4, pp. 1875–1897, 2020.
  27. J. Bjorck, C. Gomes, B. Selman, and K. Q. Weinberger, “Understanding batch normalization,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 7705–7716.
  28. Y. Hou, B. Kang, W. Van Hauwermeiren, and D. Botteldooren, “Relation-guided acoustic scene classification aided with event embeddings,” in Proc. of International Joint Conference on Neural Networks, 2022, pp. 1–8.
  29. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention is all you need,” in Proc. of International Conference on Neural Information Processing Systems, 2017, pp. 5998–6008.
  31. M. E. Bartlett, C. Edmunds, T. Belpaeme, and S. Thill, “Have I got the power? analysing and reporting statistical power in HRI,” ACM Transactions on Human-Robot Interaction (THRI), vol. 11, no. 2, pp. 1–16, 2022.
  32. I. R. Murray and J. L. Arnott, “Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion,” The Journal of the Acoustical Society of America, vol. 93, no. 2, pp. 1097–1108, 1993.
  33. R. Ratnam, D. L. Jones, B. C. Wheeler, W. D. O’Brien Jr, C. R. Lansing, and A. S. Feng, “Blind estimation of reverberation time,” The Journal of the Acoustical Society of America, vol. 114, no. 5, pp. 2877–2892, 2003.
Citations (1)

Summary

We haven't generated a summary for this paper yet.