Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks (2401.14416v1)

Published 22 Jan 2024 in eess.AS, cs.LG, and cs.SD

Abstract: Languages have long been described according to their perceived rhythmic attributes. The associated typologies are of interest in psycholinguistics as they partly predict newborns' abilities to discriminate between languages and provide insights into how adult listeners process non-native languages. Despite the relative success of rhythm metrics in supporting the existence of linguistic rhythmic classes, quantitative studies have yet to capture the full complexity of temporal regularities associated with speech rhythm. We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm. To explore this hypothesis, we trained a medium-sized recurrent neural network on a language identification task over a large database of speech recordings in 21 languages. The network had access to the amplitude envelopes and a variable identifying the voiced segments, assuming that this signal would poorly convey phonetic information but preserve prosodic features. The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases. Visualization methods show that representations built from the network activations are consistent with speech rhythm typologies, although the resulting maps are more complex than two separated clusters between stress and syllable-timed languages. We further analyzed the model by identifying correlations between network activations and known speech rhythm metrics. The findings illustrate the potential of deep learning tools to advance our understanding of speech rhythm through the identification and exploration of linguistically relevant acoustic feature spaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Language discrimination by newborns: Toward an understanding of the role of rhythm. Journal of experimental psychology. Human perception and performance, 24(3):756–66, June 1998.
  2. What we know about the initial state for language. Image, Language, Brain: Papers from the First Mind-Brain Articulation Project Symposium, (33 1):51–75, 2000.
  3. Anne Cutler. Segmentation problems, rhythmic solutions. Lingua, 92:81–104, April 1994. ISSN 0024-3841. doi:10.1016/0024-3841(94)90338-7.
  4. Kenneth L. Pike. The Intonation of American English. University of Michigan (Ann Arbor), July 1945. doi:10.2307/409880.
  5. D Abercrombrie. Elements of General Phonetics. University of Edinburgh, 1967.
  6. Peter Roach. On the distinction between ’stress-timed’ and ’syllable-timed’ languages. Linguistics Controversies, pages 73–79, 1982.
  7. Rebecca M. Dauer. Stress-timing and syllable-timing reanalyzed. Journal of Phonetics, 11:51–62, 1983.
  8. Rhythm in language acquisition. Neuroscience & Biobehavioral Reviews, 81:158–166, October 2017. ISSN 0149-7634. doi:10.1016/j.neubiorev.2016.12.012.
  9. The Perception of Prosodic Prominence. In Prosody: Theory and Experiment, pages 89–127. 2000. doi:10.1007/978-94-015-9413-4_5.
  10. Pier Bertinetto. Reflections on the dichotomy ‘stress’ vs.‘syllable-timing’. Revue de phonétique appliquée, 91(93):99–130, 1989.
  11. Klaus J Kohler. Rhythm in Speech and Language. Phonetica, 66(1-2):29–45, 2009a. ISSN 0031-8388. doi:10.1159/000208929.
  12. Ruth Elizabeth Cumming and F. Nolan. Speech Rhythm: The Language-Specific Integration of Pitch and Duration. PhD thesis, 2010.
  13. What is speech rhythm? A commentary on Arvaniti and Rodriquez, Krivokapić, and Goswami and Leong. Laboratory Phonology, 4(1):93–118, 2013. ISSN 1868-6346. doi:10.1515/lp-2013-0005.
  14. Correlates of linguistic rhythm in the speech signal. Cognition, 73(3):265–292, 1999. ISSN 00100277. doi:10.1016/S0010-0277(99)00058-X.
  15. Durational variability in speech and the Rhythm Class Hypothesis. In Laboratory Phonology 7, pages 515–546. 2002. ISBN 978-3-11-019710-5. doi:10.1515/9783110197105.
  16. Two-day-olds prefer their native language. Infant Behavior and Development, 16(4):495–500, October 1993. ISSN 0163-6383. doi:10.1016/0163-6383(93)80007-U.
  17. Speech perception and language acquisition in the first year of life. Annual review of psychology, 61:191–218, 2010.
  18. A Deep Learning Approach to Automatic Characterisation of Rhythm in Non-Native English Speech. In Interspeech 2019, pages 1836–1840. ISCA, September 2019. doi:10.21437/Interspeech.2019-3186.
  19. Amalia Arvaniti. Rhythm, Timing and the Timing of Rhythm. Phonetica, 66:46–63, 2009. doi:10.1159/000208930.
  20. How stable are acoustic metrics of contrastive speech rhythm? The Journal of the Acoustical Society of America, 127(3):1559–1569, March 2010. ISSN 0001-4966. doi:10.1121/1.3293004.
  21. Amalia Arvaniti. The usefulness of metrics in the quantification of speech rhythm. Journal of Phonetics, 40(3):351–373, May 2012. ISSN 00954470. doi:10.1016/j.wocn.2012.02.003.
  22. Speech timing and linguistic rhythm: On the acoustic bases of rhythm typologies. The Journal of the Acoustical Society of America, 137(5):2834–2845, May 2015. ISSN 0001-4966. doi:10.1121/1.4919322.
  23. Daniel L K Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience 2016 19:3, 19(3):356–365, February 2016. ISSN 1546-1726. doi:10.1038/nn.4244.
  24. Deep neural network models of sensory systems: Windows onto the role of task constraints. Current Opinion in Neurobiology, 55:121–132, 2019. doi:10.1016/j.conb.2019.02.003.
  25. Deep Learning for Cognitive Neuroscience. In M Gazzaniga, editor, The Cognitive Neurosciences, 6th Edition. MIT Press, 2019. ISBN 1903.01458v1.
  26. Deep learning. Nature, 521(7553):436–444, May 2015. ISSN 0028-0836, 1476-4687. doi:10.1038/nature14539.
  27. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017. ISSN 15577317. doi:10.1145/3065386.
  28. Dong Yu and Li Deng. Automatic Speech Recognition : A Deep Learning Approach. Springer London, London, 2015. doi:10.1007/978-1-4471-5779-3.
  29. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735.
  30. Common voice: A massively-multilingual speech corpus. In LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, pages 4218–4222, 2020. ISBN 979-10-95546-34-4.
  31. Why are some languages confused for others? Investigating data from the great language game. PLoS ONE, 12(4), 2017. ISSN 19326203. doi:10.1371/journal.pone.0165934.
  32. Acoustic-phonetic Analysis of Prominence in Swedish. In Intonation, pages 55–86. Springer, Dordrecht, 2000. doi:10.1007/978-94-011-4317-2_3.
  33. Agaath M C Sluijter and Vincent J Van Heuven. Spectral balance as an acoustic correlate of linguistic stress. Technical report, 1996.
  34. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71:1–15, November 2018. ISSN 00954470. doi:10.1016/j.wocn.2018.07.001.
  35. Praat: Doing phonetics by computer, 2018.
  36. Learning to Forget: Continual Prediction with LSTM. Neural computation, 12:2451–71, October 2000. doi:10.1162/089976600300015015.
  37. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, March 2016.
  38. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. ISSN 1533-7928.
  39. Recurrent dropout without memory loss. In COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers, pages 1757–1766, 2016. ISBN 978-4-87974-702-0. doi:10.5281/zenodo.546212.
  40. Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 1:1–48, 2008.
  41. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  42. How to Use t-SNE Effectively. Distill, 1(10):e2, October 2016. ISSN 2476-0757. doi:10.23915/distill.00002.
  43. Dropout as a Bayesian Approximation. In 33rd International Conference on Machine Learning, ICML 2016, volume 3, pages 1661–1680, 2016. ISBN 978-1-5108-2900-8.
  44. V Dellwo. Rhythm and Speech Rate: A Variation Coefficient for deltaC. In Language and Language-Processing, number 1999, pages 231–241. 2006. ISBN 3-631-50311-3.
  45. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B, 67(2):301–320, 2005.
  46. Franck Ramus. Acoustic correlates of linguistic rhythm: Perspectives. In Proceedings of Speech Prosody 2002, pages 115–120, 2002. doi:10.1.1.16.326.
  47. Rhythm measures and dimensions of durational variation in speech. The Journal of the Acoustical Society of America, 129(5):3258–3270, 2011. ISSN 0001-4966. doi:10.1121/1.3559709.
  48. Volker Dellwo. Czech Speech Rhythm and the Rhythm Class Hypothesis. English, (August):1241–1244, 2007.
  49. Marina Nespor. On the rhythm parameter in phonology. In Logical Issues in Language Acquisition, pages 157–176. De Gruyter Mouton, Berlin, Boston, 1990. ISBN 978-3-11-087037-4. doi:10.1515/9783110870374-009.
  50. Paolo Mairano. Rhythm Typology: Acoustic and Perceptive Studies. PhD thesis, Universita degli studi di Torino, March 2011.
  51. Speech recognition with primarily temporal cues. Science, 270(5234):303–304, 1995. ISSN 00368075. doi:10.1126/science.270.5234.303.
  52. Klaus J. Kohler. Whither speech rhythm research? Phonetica, 66(1-2):5–14, 2009b. ISSN 00318388. doi:10.1159/000208927.
  53. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling Has. Interspeech, (September):338–342, 2014. ISSN 0028-0836. doi:arXiv:1402.1128.
  54. Visualizing and understanding recurrent networks. In ICLR Worshop, June 2016. ISBN 978-3-319-10589-5. doi:10.1007/978-3-319-10590-1_53.
  55. Deep Neural Networks as Scientific Models. Trends in Cognitive Sciences, 23(4):305–317, April 2019. ISSN 1879307X. doi:10.1016/j.tics.2019.01.009.
  56. Thread: Circuits. Distill, March 2020. ISSN 2476-0757. doi:10.23915/distill.00024.
  57. F. Ramus and J. Mehler. Language identification with suprasegmental cues: A study based on speech resynthesis. The Journal of the Acoustical Society of America, 105(1):512–521, January 1999. ISSN 0001-4966. doi:10.1121/1.424522.
  58. Deep Neural Networks as a Computational Model for Human Shape Sensitivity. PLoS Computational Biology, 12(4):e1004896, April 2016. ISSN 15537358. doi:10.1371/journal.pcbi.1004896.
  59. Are Deep Neural Networks Adequate Behavioral Models of Human Visual Perception? Annual Review of Vision Science, 9(1):501–524, 2023. doi:10.1146/annurev-vision-120522-031739.
  60. Towards falsifiable interpretability research. In NeurIPS 2020, ML-RSA Workshop, 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets