SALMONN: Towards Generic Hearing Abilities for Large Language Models (2310.13289v2)

Published 20 Oct 2023 in cs.SD, cs.CL, and eess.AS

Abstract: Hearing is arguably an essential ability of AI agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based LLM with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.

PDF HTML Abstract

An Overview of SALMONN: Towards Generic Hearing Abilities for LLMs

The paper introduces SALMONN, a Speech Audio Language Music Open Neural Network, as a novel multimodal approach facilitating LLMs to enhance their capabilities in auditory information processing. Traditionally, LLMs have excelled in NLP tasks, showing impressive performance in text-based scenarios. SALMONN extends this success by integrating a pre-trained text-based LLM with speech and audio encoders, allowing it to interpret and respond to various auditory inputs.

Methodology

SALMONN employs a dual-encoder structure to handle different auditory inputs. The model integrates OpenAI's Whisper model for speech encoding and the BEATs audio encoder for non-speech audio processing. The outputs from these encoders are synchronized and combined using a window-level Query Transformer (Q-Former). This module converts variable-length encoder output sequences to a fixed number of tokens that seamlessly integrate with the Vicuna LLM.

The training methodology is divided into three crucial stages:

Pre-training Stage: Utilizing a large corpus of speech recognition and audio captioning data, SALMONN's Q-Former and LoRA components are pre-trained to achieve high-quality audio-text alignment.
Instruction Tuning Stage: This stage involves fine-tuning SALMONN on various tasks such as speech recognition, translation, audio captioning, and others. These tasks are treated as instruction-response pairs, enhancing the model's ability to follow complex user instructions.
Activation Tuning Stage: The novel aspect of SALMONN's training is the activation tuning stage, which addresses the task over-fitting issue observed in the initial instruction tuning. Here, few-shot learning is employed to refine the model's performance on emergent tasks that were not explicitly trained earlier. This stage also leverages a strategic reduction in the scaling factor of LoRA to activate the model's latent abilities without significantly affecting its performance on pre-trained tasks.

Empirical Evaluation

The paper evaluates SALMONN across three levels of auditory tasks:

Level 1: Tasks utilized in instruction tuning, including automatic speech recognition (ASR), audio captioning (AAC), and speech translation (AST). SALMONN demonstrates competitive performance in these areas, aligning closely with state-of-the-art models.
Level 2: Speech-based NLP tasks that SALMONN was not explicitly trained for, such as speech-based slot filling, keyword extraction, and translations to untrained languages. SALMONN achieves notable performance, indicating successful generalization and high-quality cross-modal alignment.
Level 3: New, challenging tasks like audio-based storytelling and speech audio co-reasoning, which involve interpreting and reasoning from both speech and non-speech audio inputs. SALMONN shows promising results, particularly after activation tuning, with the capability to follow complex auditory instructions and produce coherent, contextually relevant outputs.

Implications and Future Directions

The development and performance of SALMONN suggest several practical and theoretical advancements in the field of AI:

Enhanced Multimodal Understanding: The ability of SALMONN to interpret and respond to various auditory inputs represents a significant step toward creating AI with more holistic sensory capabilities.
Task Generalization: SALMONN's success in untrained tasks demonstrates the potential for generalized AI systems that can adapt to new, unseen tasks with minimal additional training.
Cross-modal Integration: The model's design highlights the importance of effective cross-modal integration techniques, such as the Q-Former and LoRA, in bridging the gap between different modalities.

Looking forward, future research could explore optimizing the activation tuning process further, integrating additional sensory modalities such as vision or touch, and developing more sophisticated methods to handle real-time, continuous audio inputs. The results from SALMONN lay the groundwork for more versatile AI agents capable of interacting with the physical world in a more nuanced and comprehensive manner.

Overall, SALMONN introduces a robust framework for enhancing the auditory capabilities of LLMs, extending their application beyond text-based tasks and opening avenues for advanced multimodal AI research.