Papers
Topics
Authors
Recent
Search
2000 character limit reached

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Published 31 Oct 2024 in eess.AS and cs.LG | (2410.23955v1)

Abstract: Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  2. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021.
  3. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022a.
  4. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022.
  5. A thousand brains: toward biologically constrained ai. SN Applied Sciences, 3(8):743, 2021.
  6. Self-supervised learning through the eyes of a child. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9960–9971. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/7183145a2a3e0ce2b68cd3735186b1d5-Paper.pdf.
  7. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature human behaviour, 7(3):430–441, 2023.
  8. STREAMER: Streaming representation learning and event segmentation in a hierarchical manner. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=EfTMRQn00d.
  9. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
  10. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2020. doi: 10.1109/icassp40776.2020.9052942. URL http://dx.doi.org/10.1109/ICASSP40776.2020.9052942.
  11. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.80. URL https://aclanthology.org/2021.acl-long.80.
  12. Common voice: A massively-multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520.
  13. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198, 2021. doi: 10.21437/Interspeech.2021-1775.
  14. Ray Jackendoff. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford University Press, 01 2002. ISBN 9780198270126. doi: 10.1093/acprof:oso/9780198270126.001.0001. URL https://doi.org/10.1093/acprof:oso/9780198270126.001.0001.
  15. Hierarchical multiscale recurrent neural networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1di0sfgl.
  16. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pages 3–11. Springer, 2018.
  17. Speechformer++: A hierarchical efficient framework for paralinguistic speech processing. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:775–788, 2023. ISSN 2329-9304. doi: 10.1109/taslp.2023.3235194. URL http://dx.doi.org/10.1109/TASLP.2023.3235194.
  18. Audiolm: A language modeling approach to audio generation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31:2523–2533, jun 2023. ISSN 2329-9290. doi: 10.1109/TASLP.2023.3288409. URL https://doi.org/10.1109/TASLP.2023.3288409.
  19. Multi-scale speaker diarization with dynamic scale weighting. pages 5080–5084, 09 2022. doi: 10.21437/Interspeech.2022-991.
  20. Uniaudio: An audio foundation model toward universal audio generation, 2024. URL https://openreview.net/forum?id=nhgTmx1TZJ.
  21. Multi-resolution huBERT: Multi-resolution speech self-supervised learning with masked unit prediction. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=kUuKFW7DIF.
  22. Variable-rate hierarchical cpc leads to acoustic unit discovery in speech. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088.
  23. Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2002–2014, 2022. doi: 10.1109/TASLP.2022.3180684.
  24. Exploration on hubert with multiple resolution. pages 3287–3291, 08 2023a. doi: 10.21437/Interspeech.2023-1337.
  25. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921, 2021. doi: 10.1109/ASRU51503.2021.9688093.
  26. Comparative layer-wise analysis of self-supervised speech models. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10096149.
  27. What Do Self-Supervised Speech Models Know About Words? Transactions of the Association for Computational Linguistics, 12:372–391, 04 2024. ISSN 2307-387X. doi: 10.1162/tacl_a_00656. URL https://doi.org/10.1162/tacl_a_00656.
  28. Harold Hotelling. Relations between two sets of variates. In Breakthroughs in statistics: methodology and distribution, pages 162–190. Springer, 1992.
  29. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4396–4406, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1448. URL https://aclanthology.org/D19-1448.
  30. SentEval: An evaluation toolkit for universal sentence representations. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1269.
  31. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, pages 1559–1571, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.117. URL https://aclanthology.org/2022.findings-naacl.117.
  32. Deeply-Supervised Nets. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 562–570, San Diego, California, USA, 09–12 May 2015. PMLR. URL https://proceedings.mlr.press/v38/lee15a.html.
  33. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  34. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4214–4217, 2010. doi: 10.1109/ICASSP.2010.5495701.
  35. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), volume 2, pages 749–752 vol.2, 2001. doi: 10.1109/ICASSP.2001.941023.
  36. GloVe: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
  37. Acoustically grounded word embeddings for improved acoustics-to-word speech recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5641–5645, 2019. doi: 10.1109/ICASSP.2019.8682903.
  38. Elements of information theory (2. ed.). Wiley, 2006. ISBN 978-0-471-24195-9. URL http://www.elementsofinformationtheory.com/.
  39. Semantic sentence similarity: Size does not always matter. In Interspeech 2021, pages 4393–4397, 2021. doi: 10.21437/Interspeech.2021-1464.
  40. An exploration of self-supervised pretrained representations for end-to-end speech recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 228–235. IEEE, 2021.
  41. Unispeech-sat: Universal speech representation learning with speaker aware pre-training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6152–6156. IEEE, 2022b.
  42. Boosting Self-Supervised Embeddings for Speech Enhancement. In Proc. Interspeech 2022, pages 186–190, 2022. doi: 10.21437/Interspeech.2022-10002.
  43. Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition? In Proc. Interspeech 2022, pages 3699–3703, 2022c. doi: 10.21437/Interspeech.2022-10019.
  44. ML-SUPERB: Multilingual Speech Universal PERformance Benchmark. In Proc. Interspeech 2023, pages 884–888, 2023b. doi: 10.21437/Interspeech.2023-1316.
  45. On the utility of self-supervised models for prosody-related tasks. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1104–1111. IEEE, 2023.
  46. Parameter efficient transfer learning for various speech processing tasks. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.