Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SALMONN: Towards Generic Hearing Abilities for Large Language Models (2310.13289v2)

Published 20 Oct 2023 in cs.SD, cs.CL, and eess.AS
SALMONN: Towards Generic Hearing Abilities for Large Language Models

Abstract: Hearing is arguably an essential ability of AI agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based LLM with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.

An Overview of SALMONN: Towards Generic Hearing Abilities for LLMs

The paper introduces SALMONN, a Speech Audio Language Music Open Neural Network, as a novel multimodal approach facilitating LLMs to enhance their capabilities in auditory information processing. Traditionally, LLMs have excelled in NLP tasks, showing impressive performance in text-based scenarios. SALMONN extends this success by integrating a pre-trained text-based LLM with speech and audio encoders, allowing it to interpret and respond to various auditory inputs.

Methodology

SALMONN employs a dual-encoder structure to handle different auditory inputs. The model integrates OpenAI's Whisper model for speech encoding and the BEATs audio encoder for non-speech audio processing. The outputs from these encoders are synchronized and combined using a window-level Query Transformer (Q-Former). This module converts variable-length encoder output sequences to a fixed number of tokens that seamlessly integrate with the Vicuna LLM.

The training methodology is divided into three crucial stages:

  1. Pre-training Stage: Utilizing a large corpus of speech recognition and audio captioning data, SALMONN's Q-Former and LoRA components are pre-trained to achieve high-quality audio-text alignment.
  2. Instruction Tuning Stage: This stage involves fine-tuning SALMONN on various tasks such as speech recognition, translation, audio captioning, and others. These tasks are treated as instruction-response pairs, enhancing the model's ability to follow complex user instructions.
  3. Activation Tuning Stage: The novel aspect of SALMONN's training is the activation tuning stage, which addresses the task over-fitting issue observed in the initial instruction tuning. Here, few-shot learning is employed to refine the model's performance on emergent tasks that were not explicitly trained earlier. This stage also leverages a strategic reduction in the scaling factor of LoRA to activate the model's latent abilities without significantly affecting its performance on pre-trained tasks.

Empirical Evaluation

The paper evaluates SALMONN across three levels of auditory tasks:

  • Level 1: Tasks utilized in instruction tuning, including automatic speech recognition (ASR), audio captioning (AAC), and speech translation (AST). SALMONN demonstrates competitive performance in these areas, aligning closely with state-of-the-art models.
  • Level 2: Speech-based NLP tasks that SALMONN was not explicitly trained for, such as speech-based slot filling, keyword extraction, and translations to untrained languages. SALMONN achieves notable performance, indicating successful generalization and high-quality cross-modal alignment.
  • Level 3: New, challenging tasks like audio-based storytelling and speech audio co-reasoning, which involve interpreting and reasoning from both speech and non-speech audio inputs. SALMONN shows promising results, particularly after activation tuning, with the capability to follow complex auditory instructions and produce coherent, contextually relevant outputs.

Implications and Future Directions

The development and performance of SALMONN suggest several practical and theoretical advancements in the field of AI:

  • Enhanced Multimodal Understanding: The ability of SALMONN to interpret and respond to various auditory inputs represents a significant step toward creating AI with more holistic sensory capabilities.
  • Task Generalization: SALMONN's success in untrained tasks demonstrates the potential for generalized AI systems that can adapt to new, unseen tasks with minimal additional training.
  • Cross-modal Integration: The model's design highlights the importance of effective cross-modal integration techniques, such as the Q-Former and LoRA, in bridging the gap between different modalities.

Looking forward, future research could explore optimizing the activation tuning process further, integrating additional sensory modalities such as vision or touch, and developing more sophisticated methods to handle real-time, continuous audio inputs. The results from SALMONN lay the groundwork for more versatile AI agents capable of interacting with the physical world in a more nuanced and comprehensive manner.

Overall, SALMONN introduces a robust framework for enhancing the auditory capabilities of LLMs, extending their application beyond text-based tasks and opening avenues for advanced multimodal AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. MusicLM: Generating music from text. arXiv:2301.11325, 2023.
  2. Flamingo: a visual language model for few-shot learning. In Proc. NeurIPS, New Orleans, 2022.
  3. PaLM 2 technical report. arXiv:2305.10403, 2023.
  4. SLURP: A spoken language understanding resource package. In Proc. EMNLP, 2020.
  5. Language models are few-shot learners. In Proc. NeurIPS, New Orleans, 2020.
  6. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
  7. X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160, 2023a.
  8. VideoLLM: Modeling video sequence with large language models. arXiv:2305.13292, 2023b.
  9. GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech, Brno, 2021.
  10. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  11. BEATs: Audio pre-training with acoustic tokenizers. In Proc. ICML, Honolulu, 2023c.
  12. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Scaling instruction-finetuned language models. arXiv:2210.11416, 2022.
  14. LibriMix: An open-source dataset for generalizable speech separation. arXiv:2005.11262, 2020.
  15. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
  16. LP-MusicCaps: LLM-based pseudo music captioning. arXiv:2307.16372, 2023.
  17. Clotho: An audio captioning dataset. In Proc. ICASSP, Barcelona, 2020.
  18. GLM: General language model pretraining with autoregressive blank infilling. In Proc. ACL, Dublin, Ireland, 2022.
  19. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
  20. Whisper-AT: Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, Dublin, Ireland, 2023a.
  21. Listen, think, and understand. arXiv:2305.10790, 2023b.
  22. LoRA: Low-Rank Adaptation of large language models. In Proc. ICLR, 2022.
  23. Dynamic-SUPERB: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. arXiv:2309.09510, 2023a.
  24. AudioGPT: Understanding and generating speech, music, sound, and talking head. arXiv:2304.12995, 2023b.
  25. Adapting self-supervised models to multi-talker speech recognition using speaker embeddings. In Proc. ICASSP, Rhodes, Greek, 2023c.
  26. Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. EMNLP, Sapporo, Japan, 2003.
  27. AudioCaps: Generating captions for audios in the wild. In Proc. NAACL-HLT, Minneapolis, 2019.
  28. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML, Hawaii, 2023a.
  29. MERT: Acoustic music understanding model with large-scale self-supervised training. arXiv:2306.00107, 2023b.
  30. Music understanding LLaMA: Advancing text-to-music generation with question answering and captioning. arXiv:2308.11276, 2023.
  31. Macaw-LLM: Multi-modal language modeling with image, audio, video, and text integration. arXiv:2306.09093, 2023.
  32. Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
  33. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv:2303.17395, 2023.
  34. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027, 2019.
  35. Joint speech recognition and audio captioning. In Proc. ICASSP, Singapore, 2022.
  36. OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  37. Training language models to follow instructions with human feedback. In Proc. NeurIPS, New Orleans, 2022.
  38. Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP, South Brisbane, 2015.
  39. Instruction tuning with GPT-4. arXiv:2304.03277, 2023.
  40. Robust speech recognition via large-scale weak supervision. In Proc. ICML, Honolulu, 2023.
  41. AudioPaLM: A large language model that can speak and listen. arXiv:2306.12925, 2023.
  42. PandaGPT: One model to instruction-follow them all. arXiv:2305.16355, 2023.
  43. Fine-grained audio-visual joint representations for multimodal large language models, 2023.
  44. Learning features of music from scratch. In Proc. ICLR, Toulon, France, 2017.
  45. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  46. Attention is all you need. In Proc. NeurIPS, Long Beach, 2017.
  47. CoVoST 2 and massively multilingual speech translation. In Proc. Interspeech, Brno, Czech Republic, 2021.
  48. Finetuned language models are zero-shot learners. In Proc. ICLR, 2022a.
  49. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022b.
  50. On decoder-only architecture for speech-to-text and large language model integration. arXiv:2307.03917, 2023.
  51. Emotion recognition by fusing time synchronous and time asynchronous representations. In Proc. ICASSP, Toronto, Canada, 2021.
  52. WikiQA: A challenge dataset for open-domain question answering. In Proc. EMNLP, Lisbon, Portugal, 2015.
  53. Connecting speech encoder and large language model for ASR. arXiv:2309.13963, 2023.
  54. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv:2305.11000, 2023a.
  55. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023b.
  56. Learning video representations from large language models. In Proc. CVPR, New Orleans, 2022.
  57. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Changli Tang (15 papers)
  2. Wenyi Yu (14 papers)
  3. Guangzhi Sun (51 papers)
  4. Xianzhao Chen (10 papers)
  5. Tian Tan (21 papers)
  6. Wei Li (1121 papers)
  7. Lu Lu (189 papers)
  8. Zejun Ma (78 papers)
  9. Chao Zhang (907 papers)
Citations (155)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com