Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs (2311.06753v2)

Published 12 Nov 2023 in cs.CL and cs.AI

Abstract: In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yassir Fathullah (16 papers)
  2. Chunyang Wu (24 papers)
  3. Egor Lakomkin (19 papers)
  4. Junteng Jia (23 papers)
  5. Yuan Shangguan (25 papers)
  6. Jay Mahadeokar (36 papers)
  7. Ozlem Kalinli (49 papers)
  8. Christian Fuegen (36 papers)
  9. Mike Seltzer (12 papers)
  10. Ke Li (722 papers)
Citations (21)
X Twitter Logo Streamline Icon: https://streamlinehq.com