Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Do Self-Supervised Speech Models Know About Words? (2307.00162v3)

Published 30 Jun 2023 in cs.CL, cs.LG, and eess.AS

Abstract: Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks -- word discrimination, word segmentation, and semantic sentence similarity -- S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  2. “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
  3. “Exploring wav2vec 2.0 on speaker verification and language identification,” in Interspeech, 2021.
  4. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  5. “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” arXiv preprint arXiv:2205.03759, 2022.
  6. “What all do audio transformer models hear? probing acoustic representations for language delivery and its structure,” arXiv preprint arXiv:2101.00387, 2021.
  7. “Exploration of a self-supervised speech model: A study on emotional corpora,” in SLT, 2023.
  8. “Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models,” arXiv preprint arXiv:2206.12489, 2022.
  9. “Automatic pronunciation assessment using self-supervised speech representation learning,” arXiv preprint arXiv:2204.03863, 2022.
  10. “Proficiency assessment of l2 spoken english using wav2vec 2.0,” in SLT, 2023.
  11. “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
  12. “Hubert: How much can a bad teacher benefit asr pre-training?,” in ICASSP, 2021.
  13. “Probing acoustic representations for phonetic properties,” in ICASSP, 2021.
  14. “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP, 2023.
  15. “Analyzing acoustic word embeddings from pre-trained self-supervised speech models,” in ICASSP, 2023.
  16. “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, 2021.
  17. “Generative spoken dialogue language modeling,” Transactions of the Association for Computational Linguistics, 2023.
  18. “Slue: New benchmark tasks for spoken language understanding evaluation on natural speech,” in ICASSP, 2022.
  19. “Superb-sg: Enhanced speech processing universal performance benchmark for semantic and generative capabilities,” arXiv preprint arXiv:2203.06849, 2022.
  20. “On the use of external data for spoken named entity recognition,” in NAACL, 2022.
  21. “Wav2seq: Pre-training speech-to-text encoder-decoder models using pseudo languages,” in ICASSP, 2023.
  22. “Speechglue: How well can self-supervised speech models capture linguistic knowledge?,” arXiv preprint arXiv:2306.08374, 2023.
  23. Hotelling Harold, “Relations between two sets of variates,” Biometrika, 1936.
  24. “SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” in NIPS, 2017.
  25. “Similarity of neural network representations revisited,” in International Conference on Machine Learning, 2019.
  26. “The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives,” in NAACL, 2019.
  27. “What can an accent identifier learn? probing phonetic and prosodic information in a wav2vec2-based accent identification model,” arXiv preprint arXiv:2306.06524, 2023.
  28. “Acoustically grounded word embeddings for improved acoustics-to-word speech recognition,” in ICASSP, 2019.
  29. “Correlation-based intrinsic evaluation of word vector representations,” arXiv preprint arXiv:1606.06710, 2016.
  30. “Evaluation of word vector representations by subspace alignment,” in EMNLP, 2015.
  31. “Glove: Global vectors for word representation,” in EMNLP, 2014.
  32. “Insights on representational similarity in neural networks with canonical correlation,” Advances in Neural Information Processing Systems, 2018.
  33. “The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling,” in Self-Supervised Learning for Speech and Audio Processing Workshop @ NeurIPS, 2020.
  34. “Multi-view recurrent neural acoustic word embeddings,” in ICLR, 2017.
  35. “Multilingual jointly trained acoustic and written word embeddings,” in Interspeech, 2020.
  36. “Whole-word segmental speech recognition with acoustic word embeddings,” in SLT, 2021.
  37. “Building a large annotated corpus of english: The penn treebank,” Computational Linguistics, 1993.
  38. “A semantic concordance,” in Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993, 1993.
  39. “Community evaluation and exchange of word vectors at wordvectors. org,” in ACL: System Demonstrations, 2014.
  40. “Problems with evaluation of word embeddings using word similarity tasks,” arXiv preprint arXiv:1605.02276, 2016.
  41. “Rapid evaluation of speech representations for spoken term discovery,” in Interspeech, 2011.
  42. “Weak top-down constraints for unsupervised acoustic model training,” in ICASSP, 2013.
  43. “Unsupervised neural network based feature extraction using weak top-down constraints,” in ICASSP, 2015.
  44. “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in ASRU, 2013.
  45. “Deep convolutional acoustic word embeddings using word-pair side information,” in ICASSP, 2016.
  46. “The zero resource speech challenge 2020: Discovering discrete subword and word units,” arXiv preprint arXiv:2010.05967, 2020.
  47. “Are word boundaries useful for unsupervised language learning?,” arXiv preprint arXiv:2210.02956, 2022.
  48. “On the difficulty of segmenting words with attention,” arXiv preprint arXiv:2109.10107, 2021.
  49. “Dp-parse: Finding word boundaries from raw speech with an instance lexicon,” Transactions of the Association for Computational Linguistics, 2022.
  50. “Word Discovery in Visually Grounded, Self-Supervised Speech Models,” in Proc. Interspeech 2022, 2022.
  51. Herman Kamper, “Word segmentation on discovered phone units with dynamic programming and self-supervised scoring,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
  52. “Unsupervised word segmentation using temporal gradient pseudo-labels,” in ICASSP, 2023.
  53. “Semantic sentence similarity: size does not always matter,” arXiv preprint arXiv:2106.08648, 2021.
  54. “Senteval: An evaluation toolkit for universal sentence representations,” in LREC, 2018.
  55. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  56. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017.
  57. “Speech model pre-training for end-to-end spoken language understanding,” in Interspeech, 2019.
  58. “Librispeech: an asr corpus based on public domain audio books,” in ICASSP, 2015.
  59. “The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability,” Speech Communication, 2005.
  60. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, 2020.
  61. “BERT rediscovers the classical NLP pipeline,” in ACL, 2019.
  62. “Wave to syntax: Probing spoken language models for syntax,” arXiv preprint arXiv:2305.18957, 2023.
  63. “Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings,” arXiv preprint arXiv:2210.12857, 2022.
Citations (22)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com