Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding (2306.07944v1)

Published 8 Jun 2023 in eess.AS, cs.AI, and cs.CL

Abstract: LLMs have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and LLM (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Mingqiu Wang (20 papers)
  2. Izhak Shafran (30 papers)
  3. Hagen Soltau (19 papers)
  4. Wei Han (202 papers)
  5. Yuan Cao (201 papers)
  6. Dian Yu (78 papers)
  7. Laurent El Shafey (15 papers)
Citations (7)