Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

562 1 2

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (2402.01831v3)

Published 2 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Augmenting LLMs to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio LLM with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks. Our demo website is https://audioflamingo.github.io/ and the code is open-sourced at https://github.com/NVIDIA/audio-flamingo.

PDF HTML Abstract

Introduction

Enhancing the capabilities of LLMs to decipher audio goes beyond speech and verbal content, encapsulating a broader spectrum of sound including non-verbal communication. Despite the impressive textual understanding exhibited by LLMs, their auditory comprehension has typically been confined to transcribed speech, neglecting the rich information in non-speech sounds. Current models enhancing auditory capabilities have not achieved a unified framework capable of strong audio understanding, multi-turn dialogue engagement, and quick adaptation to novel tasks without the need for fine-tuning. Addressing these gaps, the recently introduced Audio Flamingo significantly advances the state-of-the-art by incorporating in-context learning (ICL), retrieval augmented generation (RAG), and robust multi-turn dialogue capabilities.

Model Architecture and Training

Audio Flamingo differentiates itself by its novel architecture designed for efficient processing of variable-length audio inputs, capturing important temporal information lost in previous approaches. The audio feature extractor exploits a sliding window technique to preserve this information over longer audio inputs efficiently. To mitigate the challenge of excessive complexity in prior models, a cross-attention mechanism is employed, borrowing from the Flamingo model's methodology and ensuring linear complexity with respect to the number of audio tokens.

The model is further trained on a meticulously curated, heterogeneous dataset of approximately 5.9 million audio-text pairs. A two-stage training approach of pre-training and supervised fine-tuning optimizes the model's understanding of a wide array of sounds. This framework, with less than a third of the parameter count compared to certain existing methods, achieves superior performance across diverse audio understanding benchmarks.

Few-Shot Learning and Dialogue

The authors introduce several techniques to endow the Audio Flamingo with an effective few-shot learning mechanism through ICL-based RAG. The model demonstrates adeptness in quickly adapting to new tasks without the necessity of task-specific fine-tuning, setting new benchmarks for few-shot learning performance.

Moreover, the paper stakes a claim on uncharted territory by demonstrating robust multi-turn dialogue abilities. Through the creation of two multi-turn dialogue datasets using GPT-4, the authors exhibit the model's efficacy in sustained, contextually coherent conversations, significantly outperforming existing methods. This is coupled with an extensively diverse evaluation on both close-ended and open-ended tasks.

Conclusion and Future Work

Audio Flamingo sets a new standard for audio understanding models by integrating unprecedented audio perception, in-context adaptability, and dialogue prowess within a single framework. The research validates the model's supremacy through substantial benchmarks and paves the way for future exploration in scalable LLM integration, complex speech tasks, and multimodal applications. The approach to data training, model conditioning on audio, and the nuanced strategy for dataset creation contribute to the model's proficiency across the board, achieving state-of-the-art results in several capacities. The results are indicative of the transformative potential Audio Flamingo brings to the wider application of LLMs in understanding and interacting within audio-rich environments.

PDF Markdown Bookmark Chat (Pro)

References (93)

Authors (6)

Zhifeng Kong (26 papers)
Arushi Goel (18 papers)
Rohan Badlani (13 papers)
Wei Ping (51 papers)
Rafael Valle (31 papers)
Bryan Catanzaro (123 papers)

Citations (49)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1754724219395403949

https://twitter.com/ArxivSound/status/1754732886627156289

https://twitter.com/ArxivSound/status/1795667313460408631

https://twitter.com/RafaelValleArt/status/1807076262508380239

https://twitter.com/fly51fly/status/1754988956683321603

https://twitter.com/javaeeeee1/status/1756299473465073984

YouTube

Show All Videos

HackerNews

Nvidia Audio Flamingo, Audio LM with Few-Shot Learning and Dialogue Abilities (1 point, 0 comments)