Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations (2305.08099v3)

Published 14 May 2023 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT's masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Spoken language biomarkers for detecting cognitive impairment. In Proc. Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  409–416, 2017.
  2. XLS-R: self-supervised cross-lingual speech representation learning at scale. In Proc. Interspeech 2022, pp.  2278–2282, 2022.
  3. vq-wav2vec: Self-supervised learning of discrete speech representations. In Proc. International Conference on Learning Representations, ICLR, 2020a.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Proc. Advances in Neural Information Processing Systems, 2020b.
  5. Pattern Recognition and Machine Learning, volume 4. 2006.
  6. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation, 42(4):335–359, 2008.
  7. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518, 2022.
  8. A simple framework for contrastive learning of visual representations. In Proc. International Conference on Machine Learning, ICML, volume 119 of Proceedings of Machine Learning Research, pp. 1597–1607, 2020.
  9. Generative pre-training for speech with autoregressive predictive coding. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp.  3497–3501, 2020.
  10. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2010.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
  12. Unsupervised visual representation learning by context prediction. In Proc. International Conference on Computer Vision, ICCV, pp.  1422–1430, 2015.
  13. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021a.
  14. Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training. In Proc. Interspeech 2021, pp.  721–725, 2021b.
  15. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proc. Advances in Neural Information Processing Systems, pp.  4485–4495, 2018.
  16. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Speech Audio Process., 15(4):1435–1447, 2007.
  17. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  18. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1695–1699, 2014.
  19. Spoken language recognition: from fundamentals to practice. Proceedings of the IEEE, 101(5):1136–1159, 2013.
  20. Deep contextualized acoustic representations for semi-supervised speech recognition. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp.  6429–6433, 2020.
  21. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp.  6419–6423, 2020.
  22. Depaudionet: An efficient deep model for audio based depression classification. In Proc. International Workshop on Audio/Visual Emotion Challenge, pp.  35–42, 2016.
  23. UMAP: uniform manifold approximation and projection. J. Open Source Softw., 3(29):861, 2018.
  24. Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT press, 2012.
  25. Voxceleb: A large-scale speaker identification dataset. In Lacerda, F. (ed.), Proc. Interspeech, pp.  2616–2620, 2017.
  26. The voices from a distance challenge 2019 evaluation plan. arXiv preprint arXiv:1902.10828, 2019.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  28. Librispeech: An ASR corpus based on public domain audio books. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp.  5206–5210, 2015.
  29. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Kubin, G. and Kacic, Z. (eds.), Proc. Interspeech, pp. 161–165, 2019.
  30. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  31. Probabilistic linear discriminant analysis for inferences about identity. In Proc. International Conference on Computer Vision, pp. 1–8, 2007.
  32. Autovc: Zero-shot voice style transfer with only autoencoder loss. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proc. International Conference on Machine Learning, ICML, volume 97, pp. 5210–5219, 2019.
  33. Contentvec: An improved self-supervised speech representation by disentangling speakers. In Proc. International Conference on Machine Learning, ICML, volume 162, pp.  18003–18017, 2022.
  34. Robust speech recognition via large-scale weak supervision. CoRR, abs/2212.04356, 2022.
  35. Multi-task self-supervised learning for robust speech recognition. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp.  6989–6993, 2020.
  36. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
  37. wav2vec: Unsupervised pre-training for speech recognition. In Kubin, G. and Kacic, Z. (eds.), Proc. Interspeech 2019, pp.  3465–3469, 2019.
  38. Sculley, D. Web-scale k-means clustering. In Proc. International Conference on World Wide Web, pp. 1177–1178, 2010.
  39. Children’s speaker verification in low and zero resource conditions. Digital Signal Processing, 116:103115, 2021.
  40. Universal paralinguistic speech representations using self-supervised conformers. In Proc. ICASSP 2022, pp.  3169–3173.
  41. Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764, 2020.
  42. Commonlanguage, June 2021. URL https://doi.org/10.5281/zenodo.5036977.
  43. X-vectors: Robust DNN embeddings for speaker recognition. In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5329–5333, 2018.
  44. Deep speaker verification model for low-resource languages and vietnamese dataset. In Proc. Pacific Asia Conference on Language, Information and Computation, pp.  445–454, 2021.
  45. A survey on text-dependent and text-independent speaker verification. IEEE Access, 10:99038–99049, 2022.
  46. Comparing supervised models and learned speech representations for classifying intelligibility of disordered speech on selected phrases. arXiv preprint arXiv:2107.03985, 2021.
  47. Improving self-supervised learning for speech recognition with intermediate layer supervision. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp.  7092–7096, 2022.
  48. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735, 2021.
  49. A comprehensive review of speech emotion recognition systems. IEEE Access, 9:47795–47814, 2021.
  50. SUPERB: speech processing universal performance benchmark. In Proc. Interspeech 2021, pp.  1194–1198, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Weiwei Lin (33 papers)
  2. Chenhang He (18 papers)
  3. Man-Wai Mak (15 papers)
  4. Youzhi Tu (3 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com