A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech (2405.08237v1)
Abstract: Speech perception involves storing and integrating sequentially presented items. Recent work in cognitive neuroscience has identified temporal and contextual characteristics in humans' neural encoding of speech that may facilitate this temporal processing. In this study, we simulated similar analyses with representations extracted from a computational model that was trained on unlabelled speech with the learning objective of predicting upcoming acoustics. Our simulations revealed temporal dynamics similar to those in brain signals, implying that these properties can arise without linguistic knowledge. Another property shared between brains and the model is that the encoding patterns of phonemes support some degree of cross-context generalization. However, we found evidence that the effectiveness of these generalizations depends on the specific contexts, which suggests that this analysis alone is insufficient to support the presence of context-invariant encoding.
- (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33.
- (2017). Analyzing hidden representations in end-to-end automatic speech recognition systems. In Advances in Neural Information Processing Systems (pp. 2441–2451).
- (2023). Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7(3), 430–441.
- (2022). Opening the black box of wav2vec feature encoder. arXiv preprint arXiv:2210.15386.
- (2020). Analyzing analytical methods: The case of phonology in neural models of spoken language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4146–4156). Online: Association for Computational Linguistics.
- (2022). Domain-informed probing of wav2vec 2.0 embeddings for phonetic features. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 83–91). Seattle, Washington: Association for Computational Linguistics.
- (2022). Probing phoneme, language and speaker information in unsupervised speech representations. In Proceedings of INTERSPEECH.
- (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
- (2022). Shared computational principles for language processing in humans and deep language models. Nature Neuroscience, 25(3), 369–380.
- (2022). Neural dynamics of phoneme sequences reveal position-invariant code for content and order. Nature communications, 13(1), 6606.
- (2020, June). The revolution will not be controlled: natural stimuli in speech neuroscience. Language, Cognition and Neuroscience, 35(5), 573–582.
- (2021). HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.
- (2020). Libri-light: A benchmark for ASR with limited or no supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7669–7673).
- (2017, February). Dynamic encoding of acoustic features in neural responses to continuous speech. The Journal of Neuroscience, 37(8), 2176–2185.
- (2023). Self-supervised predictive coding models encode speaker and phonetic information in orthogonal subspaces. In Proceedings of INTERSPEECH.
- (2021). Probing acoustic representations for phonetic properties. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 311–315).
- (2023). Probing self-supervised speech models for phonetic and phonemic information: a case study in aspiration. In Proceedings of INTERSPEECH 2023. Dublin, Ireland.
- (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006–1010.
- (2022). Toward a realistic model of speech processing in the brain with self-supervised learning. Advances in Neural Information Processing, 35.
- (2022). Do self-supervised speech models develop human-like perception biases?
- (2015). Exploring how deep neural networks form phonemic categories. In Sixteenth Annual Conference of the International Speech Communication Association.
- (2016). On the role of nonlinear transformations in deep neural network acoustic models. In Proceedings of INTERSPEECH.
- (2020, December). The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv:2011.11588 [cs, eess]. (arXiv: 2011.11588)
- (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- (2015, April). Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206–5210). South Brisbane, Queensland, Australia: IEEE. doi: 10.1109/ICASSP.2015.7178964
- (2021). Layer-wise analysis of a self-supervised speech representation model. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
- (2023). Comparative layer-wise analysis of self-supervised speech models. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
- (2023). Prediction during language comprehension: What is next? Trends in Cognitive Sciences, 27(11), 1032–1052.
- (1996). Statistical learning in 8-month-old infants. Science, 274, 1926–1928.
- (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118.
- (2023). Phonemic competition in end-to-end ASR models. In Proceedings of INTERSPEECH (pp. 586–590). ISCA.
- (2023). Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. Plos Biology, 21(12), e3002366.
- (2022, September). Phonetic analysis of self-supervised representations of English speech. In Proceedings of INTERSPEECH (pp. 3583–3587). ISCA.
- (2019, June). The encoding of speech sounds in the superior temporal gyrus. Neuron, 102(6), 1096–1110.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.