MASR: Multi-label Aware Speech Representation (2307.10982v2)
Abstract: In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Multi-label Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages.
- “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357–366, 1980.
- “wav2vec: Unsupervised Pre-Training for Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of 2019 NAACL-HLT. June 2019, pp. 4171–4186, ACL.
- “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Computer Vision – ECCV 2016. 2016, pp. 69–84, Springer International Publishing.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Generative pre-training for speech with autoregressive predictive coding,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3497–3501.
- “Are discrete units necessary for spoken language modeling?,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1415–1423, 2022.
- “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the 34th NeurIPS Conference, 2020.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., p. 3451–3460, oct 2021.
- “Self-supervised learning with random-projection quantizer for speech recognition,” in Proceedings of the 39th ICML. 17–23 Jul 2022, vol. 162 of Proceedings of Machine Learning Research, pp. 3915–3924, PMLR.
- “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
- “Emotion recognition from speech: a review,” International Journal of Speech Technology, vol. 15, pp. 99–117, 2012.
- “Speaker recognition based on deep learning: An overview,” Neural networks : the official journal of the International Neural Network Society, vol. 140, pp. 65–99, 2020.
- “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179–1210, 2022.
- “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
- “Towards Learning a Universal Non-Semantic Representation of Speech,” in Proc. Interspeech 2020, 2020, pp. 140–144.
- “Joint masked cpc and ctc training for asr,” in ICASSP 2021-2021. IEEE, 2021, pp. 3045–3049.
- “Joint unsupervised and supervised training for multilingual asr,” in ICASSP 2022-2022. IEEE, 2022.
- “Label aware speech representation learning for language identification,” ArXiv, vol. abs/2306.04374, 2023.
- “wav2vec-C: A Self-Supervised Model for Speech Representation Learning,” in Proc. Interspeech 2021, 2021, pp. 711–715.
- “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250.
- “Unsupervised speech recognition,” Advances in Neural Information Processing Systems, vol. 34, pp. 27826–27839, 2021.
- “An exploration of hubert with large number of cluster units and model assessment using bayesian information criterion,” in ICASSP 2022-2022. IEEE, 2022, pp. 7107–7111.
- “End-to-end text-dependent speaker verification,” in 2016 IEEE ICASSP. IEEE, 2016, pp. 5115–5119.
- “Spoken language recognition using x-vectors.,” in Odyssey, 2018, vol. 2018, pp. 105–111.
- “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE ICASSP. IEEE, 2018, pp. 5329–5333.
- “Deep speaker: an end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
- “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
- “Frill: A non-semantic speech embedding for mobile devices,” in Interspeech, 2020.
- “TRILLsson: Distilled Universal Paralinguistic Speech Representations,” in Proc. Interspeech 2022, 2022, pp. 356–360.
- “Contrastive learning of general-purpose audio representations,” in ICASSP 2021 - 2021 IEEE, 2021, pp. 3875–3879.
- “Unispeech: Unified speech representation learning with labeled and unlabeled data,” in ICML. PMLR, 2021.
- “w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” 2021 IEEE ASRU Workshop, pp. 244–250, 2021.
- “In defense of the classification loss for person re-identification,” in Proceedings of the IEEE/CVF CVPR Workshops, June 2019.
- “Language Recognition Using Triplet Neural Networks,” in Proc. Interspeech 2019, 2019, pp. 4025–4029.
- “Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, vol. 2, pp. 8–14.
- “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in ACL, 2021.
- “Common voice: A massively-multilingual speech corpus,” in LREC, 2019.
- “MLS: A Large-Scale Multilingual Dataset for Speech Research,” in Proc. Interspeech 2020, 2020, pp. 2757–2761.
- “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued,” in Workshop on Spoken Language Technologies for Under-resourced Languages, 2014.
- “Fleurs: Few-shot learning evaluation of universal representations of speech,” 2022 IEEE Spoken Language Technology Workshop, pp. 798–805, 2022.
- “Towards building asr systems for the next billion users,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022 (to appear).
- “The flores-101 evaluation benchmark for low-resource and multilingual machine translation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 522–538, 2021.
- Matthew S. Dryer and Martin Haspelmath, Eds., WALS Online (v2020.3), Zenodo, 2013.