Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding (2406.09345v1)

Published 13 Jun 2024 in cs.CL, cs.SD, and eess.AS

Abstract: The integration of pre-trained text-based LLMs (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Suwon Shon (31 papers)
  2. Kwangyoun Kim (18 papers)
  3. Yi-Te Hsu (7 papers)
  4. Prashant Sridhar (10 papers)
  5. Shinji Watanabe (416 papers)
  6. Karen Livescu (89 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.