HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model (2310.03975v1)
Abstract: Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked prediction task, the learning criterion of HuBERT, focuses on local contextual information and may not make effective use of global semantic information such as speaker, theme of speech, and so on. In this paper, we propose a new approach to enrich the semantic representation of HuBERT. We apply topic model to pseudo-labels to generate a topic label for each utterance. An auxiliary topic classification task is added to HuBERT by using topic labels as teachers. This allows additional global semantic information to be incorporated in an unsupervised manner. Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks, including automatic speech recognition and five out of the eight SUPERB tasks. Moreover, we find that topic labels include various information about utterance, such as gender, speaker, and its theme. This highlights the effectiveness of our approach in capturing multifaceted semantic nuances.
- “SUPERB: Speech Processing Universal PERformance Benchmark” In Proc. Interspeech, 2021, pp. 1194–1198 DOI: 10.21437/Interspeech.2021-1775
- “SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities” In Proc. ACL, 2022, pp. 8479–8492
- “An exploration of self-supervised pretrained representations for end-to-end speech recognition” In Proc. ASRU, 2021
- “A study on the integration of pre-trained ssl, asr, lm and slu models for spoken language understanding” In Proc. SLT, 2023, pp. 406–413 IEEE
- “ LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech” In Proc. Interspeech, 2021, pp. 1439–1443 DOI: 10.21437/Interspeech.2021-556
- “ML-SUPERB: Multilingual Speech Universal PERformance Benchmark” In Proc. Interspeech, 2023, pp. 884–888 DOI: 10.21437/Interspeech.2023-1316
- “XTREME-S: Evaluating Cross-lingual Speech Representations” In Proc. Interspeech, 2022, pp. 3248–3252 DOI: 10.21437/Interspeech.2022-10007
- “IndicSUPERB: A speech processing universal performance benchmark for indian languages” In Proceedings of AAAI 37.11, 2023, pp. 12942–12950
- “Self-supervised speech representation learning: A review” In IJSTSP IEEE, 2022
- “HuBERT: How much can a bad teacher benefit ASR pre-training” In NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing, 2020
- “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” In NeurIPS 33, 2020
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing” In JSTSP IEEE, 2022
- “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training” In Proc. ASRU, 2021, pp. 244–250 IEEE
- “Self-supervised learning with random-projection quantizer for speech recognition” In Proc. ICML, 2022, pp. 3915–3924 PMLR
- “A Transformer-Based E2E SLU Model for Improved Semantic Parsing” In Proc. ICASSP, 2023, pp. 1–2 IEEE
- “Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model” In Proc. ICASSP, 2023, pp. 1–5 IEEE
- “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations” In Proc. Interspeech, 2021, pp. 3615–3619 DOI: 10.21437/Interspeech.2021-475
- “Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target” In Proc. Interspeech, 2023, pp. 1503–1507 DOI: 10.21437/Interspeech.2023-1718
- “Contentvec: An improved self-supervised speech representation by disentangling speakers” In Proc. ICML, 2022, pp. 18003–18017 PMLR
- Heng-Jui Chang, Alexander H Liu and James Glass “Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering” In Proc. Interspeech, 2023, pp. 2983–2987
- “Bridging Speech and Textual Pre-Trained Models With Unsupervised ASR” In Proc. ICASSP, 2023, pp. 1–5 IEEE
- David M Blei, Andrew Y Ng and Michael I Jordan “Latent dirichlet allocation” In JMLR 3.Jan, 2003, pp. 993–1022
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proc. NAACL-HLT, 2019, pp. 4171–4186
- “An enhanced guided LDA model augmented with BERT based semantic strength for aspect term extraction in sentiment analysis” In Knowledge-based systems 246 Elsevier, 2022, pp. 108668
- Maarten Grootendorst “BERTopic: Neural topic modeling with a class-based TF-IDF procedure” In arXiv preprint arXiv:2203.05794, 2022
- “Librispeech: an ASR corpus based on public domain audio books” In Proc. ICASSP, 2015, pp. 5206–5210
- “Connectionist temporal classification” In Supervised sequence labelling with recurrent neural networks Springer, 2012, pp. 61–93
- “ESPnet: End-to-End Speech Processing Toolkit” In Proc. Interspeech, 2018
- Takashi Maekaku (9 papers)
- Jiatong Shi (82 papers)
- Xuankai Chang (61 papers)
- Yuya Fujita (16 papers)
- Shinji Watanabe (416 papers)