Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On decoder-only architecture for speech-to-text and large language model integration (2307.03917v3)

Published 8 Jul 2023 in eess.AS, cs.CL, and cs.SD

Abstract: LLMs have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based LLMs. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Jian Wu (314 papers)
  2. Yashesh Gaur (43 papers)
  3. Zhuo Chen (319 papers)
  4. Long Zhou (57 papers)
  5. Yimeng Zhu (4 papers)
  6. Tianrui Wang (23 papers)
  7. Jinyu Li (164 papers)
  8. Shujie Liu (101 papers)
  9. Bo Ren (60 papers)
  10. Linquan Liu (8 papers)
  11. Yu Wu (196 papers)
Citations (103)