Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting Large Language Models with Speech Recognition Abilities (2307.11795v1)

Published 21 Jul 2023 in eess.AS, cs.AI, cs.CL, and cs.LG

Abstract: LLMs have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yassir Fathullah (16 papers)
  2. Chunyang Wu (24 papers)
  3. Egor Lakomkin (19 papers)
  4. Junteng Jia (23 papers)
  5. Yuan Shangguan (25 papers)
  6. Ke Li (722 papers)
  7. Jinxi Guo (15 papers)
  8. Wenhan Xiong (47 papers)
  9. Jay Mahadeokar (36 papers)
  10. Ozlem Kalinli (49 papers)
  11. Christian Fuegen (36 papers)
  12. Mike Seltzer (12 papers)
Citations (104)
X Twitter Logo Streamline Icon: https://streamlinehq.com