Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation (2310.09424v1)

Published 13 Oct 2023 in cs.CL, cs.HC, cs.SD, and eess.AS

Abstract: We present a novel Speech Augmented LLM (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. OpenAI, “Gpt-4 technical report,” arXiv, 2023.
  2. “Palm 2 technical report,” arXiv:2305.10403, 2023.
  3. “Audiogpt: Understanding and generating speech, music, sound, and talking head,” arXiv:2304.12995, 2023.
  4. Feilong Chen et al., “X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023.
  5. “Large-scale language model rescoring on long-form data,” in ICASSP. IEEE, 2023, pp. 1–5.
  6. Rao Ma et al., “N-best t5: Robust ASR error correction using multiple input hypotheses and constrained decoding space,” arXiv:2303.00456, 2023.
  7. “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv:2305.11000, 2023.
  8. “Audiopalm: A large language model that can speak and listen,” arXiv:2306.12925, 2023.
  9. “Listen, think, and understand,” arXiv:2305.10790, 2023.
  10. “Adapting large language model with speech for fully formatted end-to-end speech recognition,” arXiv:2307.08234, 2023.
  11. “On decoder-only architecture for speech-to-text and large language model integration,” arXiv:2307.03917, 2023.
  12. “Prompting large language models with speech recognition abilities,” arXiv:2307.11795, 2023.
  13. “Speech-to-text adapter and speech-to-entity retriever augmented llms for speech understanding,” arXiv:2306.07944, 2023.
  14. Mohammad Shoeybi et al., “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  15. “NeMo: a toolkit for building ai applications using neural modules,” arXiv:1909.09577, 2019.
  16. “Can contextual biasing remain effective with whisper and gpt-2?,” arXiv preprint arXiv:2306.01942, 2023.
  17. Chengyi Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  18. Matthew Le et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
  19. “Contextual speech recognition in end-to-end neural network systems using beam search.,” in Interspeech, 2018.
  20. “Deep context: end-to-end contextual speech recognition,” in 2018 IEEE spoken language technology workshop (SLT), 2018.
  21. Samuel Thomas et al., “Integrating text inputs for training and adapting rnn transducer asr models,” in ICASSP. IEEE, 2022.
  22. “Maestro: Matched speech text representations through modality matching,” arXiv:2204.03409, 2022.
  23. “Speechlm: Enhanced speech pre-training with unpaired textual data,” arXiv:2209.15329, 2022.
  24. “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv:2303.01037, 2023.
  25. “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv:2308.11596, 2023.
  26. “Fast conformer with linearly scalable attention for efficient speech recognition,” arXiv:2305.05084, 2023.
  27. Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  28. “Lora: Low-rank adaptation of large language models,” arXiv:2106.09685, 2021.
  29. Jason Wei et al., “Finetuned language models are zero-shot learners,” in ICLR, 2021.
  30. “A survey for in-context learning,” arXiv:2301.00234, 2022.
  31. Denis Kocetkov et al., “The stack: 3 tb of permissively licensed source code,” Preprint, 2022.
  32. “The curious case of neural text degeneration,” arXiv preprint arXiv:1904.09751, 2019.
  33. “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  34. Milind Agarwal et al., “Findings of the IWSLT 2023 Evaluation Campaign,” in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). 2023, Association for Computational Linguistics.
  35. “Must-c: A multilingual corpus for end-to-end speech translation,” Computer Speech & Language, vol. 66, pp. 101155, 2021.
  36. “Accelerating rnn transducer inference via adaptive expansion search,” IEEE Signal Processing Letters, vol. 27, pp. 2019–2023, 2020.
  37. “On the alignment problem in multi-head attention-based neural machine translation,” arXiv:1809.03985, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhehuai Chen (39 papers)
  2. He Huang (97 papers)
  3. Andrei Andrusenko (12 papers)
  4. Oleksii Hrinchuk (20 papers)
  5. Krishna C. Puvvada (28 papers)
  6. Jason Li (91 papers)
  7. Subhankar Ghosh (41 papers)
  8. Jagadeesh Balam (39 papers)
  9. Boris Ginsburg (111 papers)
Citations (35)