An Embarrassingly Simple Approach for LLM with Strong ASR Capacity (2402.08846v1)
Abstract: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and LLMs (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community.
- Flamingo: a visual language model for few-shot learning. In Proc. NeurIPS.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. NeurIPS.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Fuyu-8B: A multimodal architecture for AI agents. https://www.adept.ai/blog/fuyu-8b.
- Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. ICASSP.
- Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech.
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. In Proc. JSTSP.
- BEATs: Audio pre-training with acoustic tokenizers. In Proc. ICML.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://vicuna.lmsys.org.
- Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
- CIF: Continuous integrate-and-fire for end-to-end speech recognition. In Proc. ICASSP.
- Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795.
- LLaMA-Adapter v2: Parameter-efficient visual instruction model. In Proc. ICLR.
- Paraformer: Fast and accurate parallel Transformer for non-autoregressive end-to-end speech recognition. In Proc. Interspeech.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML.
- Speech recognition with deep recurrent neural networks. In Proc. ICASSP.
- Conformer: Convolution-augmented Transformer for speech recognition. In Proc. Interspeech.
- Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191.
- HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. In Proc. TASLP.
- LoRA: Low-rank adaptation of large language models. In Proc. ICLR.
- Libri-light: A benchmark for asr with limited or no supervision. In Proc. ICASSP.
- Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proc. ICASSP.
- Pruned RNN-T for fast, memory-efficient ASR training. In Proc. Interspeech.
- Jinyu Li. 2022. Recent advances in end-to-end automatic speech recognition. In Proc. APSIPA.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML.
- Prompting large language models for zero-shot domain adaptation in speech recognition. In Proc. ASRU.
- Visual instruction tuning. In Proc. NeurIPS.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proc. ICLR.
- MT4SSL: Boosting self-supervised speech representation learning by integrating multiple targets. In Proc. Interspeech.
- Librispeech: An ASR corpus based on public domain audio books. In Proc. of ICASSP.
- Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In Proc. ICML.
- Owsm v3. 1: Better and faster open Whisper-style speech models based on E-Branchformer. arXiv preprint arXiv:2401.16658.
- Robust speech recognition via large-scale weak supervision. In Proc. ICML.
- SALMONN: Towards generic hearing abilities for large language models. In Proc. ICLR.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proc. ACL.
- LauraGPT: Listen, attend, understand, and regenerate audio with GPT. arXiv preprint arXiv:2310.04673.
- On decoder-only architecture for speech-to-text and large language model integration. In Proc. ASRU.
- Fast-HuBERT: an efficient training framework for self-supervised speech representation learning. In Proc. ASRU.
- Zipformer: A faster and better encoder for automatic speech recognition. In Proc. ICLR.
- Connecting speech encoder and large language model for ASR. In Proc. ICASSP.
- SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Proc. EMNLP.
- TinyLLaMA: An open-source small language model. arXiv preprint arXiv:2401.02385.
- Ziyang Ma (73 papers)
- Guanrou Yang (12 papers)
- Yifan Yang (578 papers)
- Zhifu Gao (28 papers)
- Jiaming Wang (37 papers)
- Zhihao Du (30 papers)
- Fan Yu (63 papers)
- Qian Chen (264 papers)
- Siqi Zheng (61 papers)
- Shiliang Zhang (132 papers)
- Xie Chen (165 papers)