Introducing Semantics into Speech Encoders (2211.08402v1)
Abstract: Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to LLM systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.
- Derek Xu (10 papers)
- Shuyan Dong (7 papers)
- Changhan Wang (46 papers)
- Suyoun Kim (22 papers)
- Zhaojiang Lin (45 papers)
- Akshat Shrivastava (25 papers)
- Shang-Wen Li (55 papers)
- Liang-Hsuan Tseng (9 papers)
- Alexei Baevski (39 papers)
- Guan-Ting Lin (21 papers)
- Hung-yi Lee (327 papers)
- Yizhou Sun (149 papers)
- Wei Wang (1793 papers)