On decoder-only architecture for speech-to-text and large language model integration (2307.03917v3)
Abstract: LLMs have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based LLMs. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
- Jian Wu (314 papers)
- Yashesh Gaur (43 papers)
- Zhuo Chen (319 papers)
- Long Zhou (57 papers)
- Yimeng Zhu (4 papers)
- Tianrui Wang (23 papers)
- Jinyu Li (164 papers)
- Shujie Liu (101 papers)
- Bo Ren (60 papers)
- Linquan Liu (8 papers)
- Yu Wu (196 papers)