A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding (2211.05869v1)
Abstract: Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and LLMs (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
- Yifan Peng (147 papers)
- Siddhant Arora (50 papers)
- Yosuke Higuchi (23 papers)
- Yushi Ueda (7 papers)
- Sujay Kumar (2 papers)
- Karthik Ganesan (9 papers)
- Siddharth Dalmia (36 papers)
- Xuankai Chang (61 papers)
- Shinji Watanabe (416 papers)