Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective (2401.08833v1)

Published 16 Jan 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Existing studies on self-supervised speech representation learning have focused on developing new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look into existing self-supervised methods of speech from an information-theoretic perspective. We aim to develop metrics using mutual information to help practical problems such as model design and selection. We use linear probes to estimate the mutual information between the target information and learned representations, showing another insight into the accessibility to the target information from speech representations. Further, we explore the potential of evaluating representations in a self-supervised fashion, where we estimate the mutual information between different parts of the data without using any labels. Finally, we show that both supervised and unsupervised measures echo the performance of the models on layer-wise linear probing and speech recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. “Generative pre-training for speech with autoregressive predictive coding,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3497–3501.
  2. “Vector-quantized autoregressive predictive coding,” arXiv preprint arXiv:2005.08392, 2020.
  3. “Autoregressive co-training for learning discrete speech representations,” arXiv preprint arXiv:2203.15840, 2022.
  4. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  5. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  6. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  7. “Dinosr: Self-distillation and online clustering for self-supervised speech representation learning,” arXiv preprint arXiv:2305.10005, 2023.
  8. “Autoregressive predictive coding: A comprehensive study,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1380–1390, 2022.
  9. “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
  10. “Information-theoretic probing for linguistic structure,” arXiv preprint arXiv:2004.03061, 2020.
  11. “Information-theoretic probing with minimum description length,” arXiv preprint arXiv:2003.12298, 2020.
  12. “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  13. “A mutual information maximization perspective of language representation learning,” arXiv preprint arXiv:1910.08350, 2019.
  14. “Measuring the interpretability of unsupervised representations via quantized reverse probing,” arXiv preprint arXiv:2209.03268, 2022.
  15. “Decoar 2.0: Deep contextualized acoustic representations with vector quantization,” arXiv preprint arXiv:2012.06659, 2020.
  16. “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
  17. “Self-supervised learning with random-projection quantizer for speech recognition,” in International Conference on Machine Learning. PMLR, 2022, pp. 3915–3924.
  18. David McAllester, “Information theoretic co-training,” arXiv preprint arXiv:1802.07572, 2018.
  19. “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  20. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
  21. “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921.
  22. “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com