Encoding of lexical tone in self-supervised models of spoken language (2403.16865v2)
Abstract: Interpretability research has shown that self-supervised Spoken LLMs (SLMs) encode a wide variety of features in human speech from the acoustic, phonetic, phonological, syntactic and semantic levels, to speaker characteristics. The bulk of prior research on representations of phonology has focused on segmental features such as phonemes; the encoding of suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet well understood. Tone is a suprasegmental feature that is present in more than half of the world's languages. This paper aims to analyze the tone encoding capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages. We further find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies, but they do not follow the same developmental trajectory.
- How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings.
- Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
- Neural representations for modeling variation in speech. Journal of Phonetics, 92:101137.
- Agnes Belotel-Grenie and Michel Grenie. 1994. Phonation types analysis in standard chinese. In 3rd International Conference on Spoken Language Processing, ICSLP 1994, pages 343–346. The International Society for Computers and Their Applications (ISCA).
- Paul Boersma and David Weenink. 2021. Praat: Doing phonetics by computer [Computer program].
- Marc Brunelle. 2009. Tone perception in northern and southern vietnamese. Journal of Phonetics, 37(1):79–96.
- AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.
- Yuan Chai. 2019. THE SOURCE OF CREAK IN MANDARIN UTTERANCES.
- Matthew Y. Chen. 2000. Tone Sandhi: Patterns across Chinese Dialects. Cambridge Studies in Linguistics. Cambridge University Press, Cambridge.
- Computational Modelling of Tone Perception Based on Direct Processing of f0 Contours. Brain Sciences, 12(3):337.
- Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Interspeech 2021, pages 2426–2430. ISCA.
- What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics.
- Introducing Meta-analysis in the Evaluation of Computational Models of Infant Language Development. Cognitive Science, 47(7):e13307.
- Probing phoneme, language and speaker information in unsupervised speech representations. In Interspeech.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Christian T. DiCanio. 2012. Cross-linguistic perception of Itunyoso Trique tone. Journal of Phonetics, 40(5):672–688.
- AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale.
- Yen-Chen Hao. 2012. Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers. Journal of Phonetics, 40(2):269–279.
- John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.
- Ranzo Huang and Brian Mak. 2023. Wav2vec 2.0 ASR for Cantonese-Speaking Older Adults in a Clinical Setting. In INTERSPEECH 2023, pages 4958–4962. ISCA.
- Tsan Huang and Keith Johnson. 2011. Language Specificity in Speech Perception: Perception of Mandarin Tones by Native and Nonnative Listeners. Phonetica, 67(4):243–267.
- Yaqian Huang. 2020. Different attributes of creaky voice distinctly affect Mandarin tonal perception. The Journal of the Acoustical Society of America, 147(3):1441–1458.
- Larry M. Hyman. 2018. What tone teaches us about language. Language, 94(3):698–709.
- Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71:1–15.
- Sun-Ah Jun and Haruo Kubozono. 2020. Asian Pacific Rim. In Carlos Gussenhoven and Aoju Chen, editors, The Oxford Handbook of Language Prosody, Oxford Handbooks. Oxford University Press, Oxford, New York.
- James Kirby. 2008. vPhon: A Vietnamese phonetizer (version 2.1.1).
- James P. Kirby. 2011. Vietnamese (hanoi vietnamese). Journal of the international phonetic association, 41(3):381–392.
- Jianjing Kuang. 2017. Covariation between voice quality and pitch: Revisiting the case of Mandarin creaky voice. The Journal of the Acoustical Society of America, 142(3):1693–1706.
- Statistical learning models of early phonetic acquisition struggle with child-centered audio data.
- Liquan Liu and René Kager. 2014. Perception of tones by infants learning a non-tone language. Cognition, 133(2):385–394.
- Ke-Han Lu and Kuan-Yu Chen. 2022. A context-aware knowledge transferring strategy for CTC-based ASR.
- Hieu-Thi Luong and Hai-Quan Vu. 2016. A non-expert Kaldi recipe for Vietnamese speech recognition system. In Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016), pages 51–55, Osaka, Japan. The COLING 2016 Organizing Committee.
- Probing Acoustic Representations for Phonetic Properties. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 311–315.
- Ltd. Magic Data Technology Co. 2019. MAGICDATA Mandarin Chinese Read Speech Corpus.
- Probing Self-supervised Speech Models for Phonetic and Phonemic Information: A Case Study in Aspiration. In INTERSPEECH 2023, pages 251–255. ISCA.
- Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech 2017, pages 498–502. ISCA.
- Librosa/librosa: 0.10.1. Zenodo.
- A precursor of language acquisition in young infants. Cognition, 29(2):143–178.
- Language discrimination by newborns: Toward an understanding of the role of rhythm. Journal of Experimental Psychology. Human Perception and Performance, 24(3):756–766.
- Thai Binh Nguyen. 2021. Vietnamese end-to-end speech recognition using wav2vec 2.0.
- Fairseq: A Fast, Extensible Toolkit for Sequence Modeling. pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
- Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
- LeBenchmark 2.0: A Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech.
- Layer-wise Analysis of a Self-supervised Speech Representation Model.
- MLS: A Large-Scale Multilingual Dataset for Speech Research. In Interspeech 2020, pages 2757–2761.
- Robust Speech Recognition via Large-Scale Weak Supervision.
- Going beyond F0: The acquisition of Mandarin tones. Journal of Child Language, 48(2):387–398.
- Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information. In Speech Prosody 2014, pages 673–677. ISCA.
- Mandarin tone classification without pitch tracking. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4868–4872, Florence, Italy. IEEE.
- Vance Schaefer and Isabelle Darcy. 2014. Lexical function of pitch in the first language shapes cross-linguistic perception of Thai tones. Laboratory Phonology, 5(4):489–522.
- Wave to Syntax: Probing spoken language models for syntax. In INTERSPEECH 2023, pages 1259–1263.
- Perception and Representation of Lexical Tones in Native Mandarin-Learning Infants and Toddlers. Frontiers in Psychology, 8.
- Leher Singh and Charlene S. L. Fu. 2016. A New View of Language Development: The Acquisition of Lexical Tone. Child Development, 87(3):834–854.
- Spoken word recognition in early childhood: Comparative effects of vowel, consonant and lexical tone variation. Cognition, 142:1–11.
- Connie K. So and Catherine T. Best. 2010. Cross-language Perception of Non-native Tonal Contrasts: Effects of Native Phonological and Phonetic Influences. Language and speech, 53(Pt 2):273–293.
- Kimiko Tsukada and Mariko Kondo. 2019. The Perception of Mandarin Lexical Tones by Native Speakers of Burmese. Language and Speech, 62(4):625–640.
- Attention Is All You Need.
- Dong Wang and Xuewei Zhang. 2015. THCHS-30 : A Free Chinese Speech Corpus.
- Xinchun Wang and Jidong Chen. 2020. The Acquisition of Mandarin Consonants by English Learners: The Relationship between Perception and Production. Languages, 5(2):20.
- Using Computational Models to Test Syntactic Learnability. Linguistic Inquiry, pages 1–44.
- Mandarin lexical tones: A corpus-based study of word length, syllable position and prosodic position on duration. pages 1908–1912.
- Categorical Perception of VOT and Lexical Tones in Chinese and the Developmental Course. Acta Psychologica Sinica, 41:572–579.
- When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of memory and language, 68(2):123–139.
- Moira Yip. 2002. Tone. Cambridge Textbooks in Linguistics. Cambridge University Press, Cambridge.
- Automatic recognition of suprasegmentals in speech.
- Eric Zee. 1991. Chinese (Hong Kong Cantonese). Journal of the International Phonetic Association, 21(1):46–48.
- Phone-to-audio alignment without text: A Semi-supervised Approach. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).