A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR (2406.17272v1)
Abstract: Recent works have shown promising results in connecting speech encoders to LLMs for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We begin by investigating more thoughtful fine-tuning schemes. Next, we propose a matching loss to enhance alignment between modalities. Finally, we explore training and inference methods to mitigate high insertion errors. Experimental results on the Librispeech corpus demonstrate that partially fine-tuning the encoder and LLM using parameter-efficient methods, such as LoRA, is the most cost-effective approach. Additionally, the matching loss improves modality alignment, enhancing performance. The proposed training and inference methods significantly reduce insertion errors.
- Van Tung Pham (13 papers)
- Yist Lin (1 paper)
- Tao Han (233 papers)
- Wei Li (1121 papers)
- Jun Zhang (1008 papers)
- Lu Lu (189 papers)
- Yuxuan Wang (239 papers)