What Do Self-Supervised Speech Models Know About Words? (2307.00162v3)
Abstract: Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks -- word discrimination, word segmentation, and semantic sentence similarity -- S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.
- “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
- “Exploring wav2vec 2.0 on speaker verification and language identification,” in Interspeech, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” arXiv preprint arXiv:2205.03759, 2022.
- “What all do audio transformer models hear? probing acoustic representations for language delivery and its structure,” arXiv preprint arXiv:2101.00387, 2021.
- “Exploration of a self-supervised speech model: A study on emotional corpora,” in SLT, 2023.
- “Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models,” arXiv preprint arXiv:2206.12489, 2022.
- “Automatic pronunciation assessment using self-supervised speech representation learning,” arXiv preprint arXiv:2204.03863, 2022.
- “Proficiency assessment of l2 spoken english using wav2vec 2.0,” in SLT, 2023.
- “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
- “Hubert: How much can a bad teacher benefit asr pre-training?,” in ICASSP, 2021.
- “Probing acoustic representations for phonetic properties,” in ICASSP, 2021.
- “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP, 2023.
- “Analyzing acoustic word embeddings from pre-trained self-supervised speech models,” in ICASSP, 2023.
- “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, 2021.
- “Generative spoken dialogue language modeling,” Transactions of the Association for Computational Linguistics, 2023.
- “Slue: New benchmark tasks for spoken language understanding evaluation on natural speech,” in ICASSP, 2022.
- “Superb-sg: Enhanced speech processing universal performance benchmark for semantic and generative capabilities,” arXiv preprint arXiv:2203.06849, 2022.
- “On the use of external data for spoken named entity recognition,” in NAACL, 2022.
- “Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages,” in ICASSP, 2023.
- “Speechglue: How well can self-supervised speech models capture linguistic knowledge?,” arXiv preprint arXiv:2306.08374, 2023.
- Hotelling Harold, “Relations between two sets of variates,” Biometrika, 1936.
- “SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” in NIPS, 2017.
- “Similarity of neural network representations revisited,” in International Conference on Machine Learning, 2019.
- “The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives,” in NAACL, 2019.
- “What can an accent identifier learn? probing phonetic and prosodic information in a wav2vec2-based accent identification model,” arXiv preprint arXiv:2306.06524, 2023.
- “Acoustically grounded word embeddings for improved acoustics-to-word speech recognition,” in ICASSP, 2019.
- “Correlation-based intrinsic evaluation of word vector representations,” arXiv preprint arXiv:1606.06710, 2016.
- “Evaluation of word vector representations by subspace alignment,” in EMNLP, 2015.
- “Glove: Global vectors for word representation,” in EMNLP, 2014.
- “Insights on representational similarity in neural networks with canonical correlation,” Advances in Neural Information Processing Systems, 2018.
- “The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling,” in Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS, 2020.
- “Multi-view recurrent neural acoustic word embeddings,” in ICLR, 2017.
- “Multilingual jointly trained acoustic and written word embeddings,” in Interspeech, 2020.
- “Whole-word segmental speech recognition with acoustic word embeddings,” in SLT, 2021.
- “Building a large annotated corpus of english: The penn treebank,” Computational Linguistics, 1993.
- “A semantic concordance,” in Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993, 1993.
- “Community evaluation and exchange of word vectors at wordvectors. org,” in ACL: System Demonstrations, 2014.
- “Problems with evaluation of word embeddings using word similarity tasks,” arXiv preprint arXiv:1605.02276, 2016.
- “Rapid evaluation of speech representations for spoken term discovery,” in Interspeech, 2011.
- “Weak top-down constraints for unsupervised acoustic model training,” in ICASSP, 2013.
- “Unsupervised neural network based feature extraction using weak top-down constraints,” in ICASSP, 2015.
- “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in ASRU, 2013.
- “Deep convolutional acoustic word embeddings using word-pair side information,” in ICASSP, 2016.
- “The zero resource speech challenge 2020: Discovering discrete subword and word units,” arXiv preprint arXiv:2010.05967, 2020.
- “Are word boundaries useful for unsupervised language learning?,” arXiv preprint arXiv:2210.02956, 2022.
- “On the difficulty of segmenting words with attention,” arXiv preprint arXiv:2109.10107, 2021.
- “Dp-parse: Finding word boundaries from raw speech with an instance lexicon,” Transactions of the Association for Computational Linguistics, 2022.
- “Word Discovery in Visually Grounded, Self-Supervised Speech Models,” in Proc. Interspeech 2022, 2022.
- Herman Kamper, “Word segmentation on discovered phone units with dynamic programming and self-supervised scoring,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
- “Unsupervised word segmentation using temporal gradient pseudo-labels,” in ICASSP, 2023.
- “Semantic sentence similarity: size does not always matter,” arXiv preprint arXiv:2106.08648, 2021.
- “Senteval: An evaluation toolkit for universal sentence representations,” in LREC, 2018.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
- “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017.
- “Speech model pre-training for end-to-end spoken language understanding,” in Interspeech, 2019.
- “Librispeech: an asr corpus based on public domain audio books,” in ICASSP, 2015.
- “The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability,” Speech Communication, 2005.
- “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, 2020.
- “BERT rediscovers the classical NLP pipeline,” in ACL, 2019.
- “Wave to syntax: Probing spoken language models for syntax,” arXiv preprint arXiv:2305.18957, 2023.
- “Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings,” arXiv preprint arXiv:2210.12857, 2022.