Connecting Speech Encoder and Large Language Model for ASR (2309.13963v2)
Abstract: The impressive capability and versatility of LLMs have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data.
- OpenAI, “GPT-4 technical report,” arXiv:2308.11276, 2023.
- “Language models are few-shot learners,” in Proc. NeurIPS, 2020.
- “PaLM 2 technical report,” arXiv:2305.10403, 2023.
- “LLaMA: Open and efficient foundation language models,” arXiv:2302.13971, 2023.
- “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” https://vicuna.lmsys.org (accessed 14 April 2023), 2023.
- “Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation,” arXiv:2104.12369, 2021.
- “AudioGPT: Understanding and generating speech, music, sound, and talking head,” arXiv:2304.12995, 2023.
- “HuggingGPT: Solving ai tasks with chatGPT and its friends in huggingface,” arXiv:2303.17580, 2023.
- “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv:2305.11000, 2023.
- “AudioPaLM: A large language model that can speak and listen,” arXiv:2306.12925, 2023.
- “X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023.
- “LLaSM: Large language and speech model,” arXiv:2308.15930, 2023.
- “On decoder-only architecture for speech-to-text and large language model integration,” arXiv:2307.03917, 2023.
- “Prompting large language models with speech recognition abilities,” arXiv:2307.11795, 2023.
- “Prompting large language models for zero-shot domain adaptation in speech recognition,” arXiv:2306.16007, 2023.
- “Can generative large language models perform asr error correction?,” arXiv:2307.04172, 2023.
- “Leveraging large language models for exploiting asr uncertainty,” arXiv:2309.04842, 2023.
- “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, Honolulu, 2023.
- “Macaw-LLM: Multi-modal language modeling with image, audio, video, and text integration,” arXiv:2306.09093, 2023.
- “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proc. ICML, Vienna, 2023.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “W2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. ASRU, Cartagena, 2021.
- “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv:2303.01037, 2023.
- “Flamingo: A visual language model for few-shot learning,” in Proc. NeurIPS, New Orleans, 2022.
- “MiniGPT-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023.
- “InstructBLIP: Towards general-purpose vision-language models with instruction tuning,” arXiv:2305.06500, 2023.
- “Video-LLaMA: An instruction-tuned audio-visual language model for video understanding,” arXiv:2306.02858, 2023.
- “PandaGPT: One model to instruction-follow them all,” arXiv:2305.16355, 2023.
- “VideoLLM: Modeling video sequence with large language models,” arXiv:2305.13292, 2023.
- “Video-ChatGPT: Towards detailed video understanding via large vision and language models,” arXiv:2306.05424, 2023.
- “Listen, think, and understand,” arXiv:2305.10790, 2023.
- “Music understanding LLaMA: Advancing text-to-music generation with question answering and captioning,” arXiv:2308.11276, 2023.
- “Attention is all you need,” in Proc. NeurIPS, Long Beach, 2017.
- “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, South Brisbane, 2015.
- “Common Voice: A massively-multilingual speech corpus,” in Proc. LREC, Marseille, 2020.
- “GigaSpeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” in Proc. Interspeech, Brno, 2021.
- “Random utterance concatenation based data augmentation for improving short-video speech recognition,” in Proc. Interspeech, Dublin, 2023.
- Wenyi Yu (14 papers)
- Changli Tang (15 papers)
- Guangzhi Sun (51 papers)
- Xianzhao Chen (10 papers)
- Tian Tan (21 papers)
- Wei Li (1122 papers)
- Lu Lu (189 papers)
- Zejun Ma (78 papers)
- Chao Zhang (907 papers)