SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation (2310.09424v1)
Abstract: We present a novel Speech Augmented LLM (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
- OpenAI, “Gpt-4 technical report,” arXiv, 2023.
- “Palm 2 technical report,” arXiv:2305.10403, 2023.
- “Audiogpt: Understanding and generating speech, music, sound, and talking head,” arXiv:2304.12995, 2023.
- Feilong Chen et al., “X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023.
- “Large-scale language model rescoring on long-form data,” in ICASSP. IEEE, 2023, pp. 1–5.
- Rao Ma et al., “N-best t5: Robust ASR error correction using multiple input hypotheses and constrained decoding space,” arXiv:2303.00456, 2023.
- “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv:2305.11000, 2023.
- “Audiopalm: A large language model that can speak and listen,” arXiv:2306.12925, 2023.
- “Listen, think, and understand,” arXiv:2305.10790, 2023.
- “Adapting large language model with speech for fully formatted end-to-end speech recognition,” arXiv:2307.08234, 2023.
- “On decoder-only architecture for speech-to-text and large language model integration,” arXiv:2307.03917, 2023.
- “Prompting large language models with speech recognition abilities,” arXiv:2307.11795, 2023.
- “Speech-to-text adapter and speech-to-entity retriever augmented llms for speech understanding,” arXiv:2306.07944, 2023.
- Mohammad Shoeybi et al., “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
- “NeMo: a toolkit for building ai applications using neural modules,” arXiv:1909.09577, 2019.
- “Can contextual biasing remain effective with whisper and gpt-2?,” arXiv preprint arXiv:2306.01942, 2023.
- Chengyi Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- Matthew Le et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
- “Contextual speech recognition in end-to-end neural network systems using beam search.,” in Interspeech, 2018.
- “Deep context: end-to-end contextual speech recognition,” in 2018 IEEE spoken language technology workshop (SLT), 2018.
- Samuel Thomas et al., “Integrating text inputs for training and adapting rnn transducer asr models,” in ICASSP. IEEE, 2022.
- “Maestro: Matched speech text representations through modality matching,” arXiv:2204.03409, 2022.
- “Speechlm: Enhanced speech pre-training with unpaired textual data,” arXiv:2209.15329, 2022.
- “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv:2303.01037, 2023.
- “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv:2308.11596, 2023.
- “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv:2305.05084, 2023.
- Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- “Lora: Low-rank adaptation of large language models,” arXiv:2106.09685, 2021.
- Jason Wei et al., “Finetuned language models are zero-shot learners,” in ICLR, 2021.
- “A survey for in-context learning,” arXiv:2301.00234, 2022.
- Denis Kocetkov et al., “The stack: 3 tb of permissively licensed source code,” Preprint, 2022.
- “The curious case of neural text degeneration,” arXiv preprint arXiv:1904.09751, 2019.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- Milind Agarwal et al., “Findings of the IWSLT 2023 Evaluation Campaign,” in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). 2023, Association for Computational Linguistics.
- “Must-c: A multilingual corpus for end-to-end speech translation,” Computer Speech & Language, vol. 66, pp. 101155, 2021.
- “Accelerating rnn transducer inference via adaptive expansion search,” IEEE Signal Processing Letters, vol. 27, pp. 2019–2023, 2020.
- “On the alignment problem in multi-head attention-based neural machine translation,” arXiv:1809.03985, 2018.
- Zhehuai Chen (39 papers)
- He Huang (97 papers)
- Andrei Andrusenko (12 papers)
- Oleksii Hrinchuk (20 papers)
- Krishna C. Puvvada (28 papers)
- Jason Li (91 papers)
- Subhankar Ghosh (41 papers)
- Jagadeesh Balam (39 papers)
- Boris Ginsburg (111 papers)