Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (2402.01831v3)

Published 2 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Augmenting LLMs to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio LLM with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks. Our demo website is https://audioflamingo.github.io/ and the code is open-sourced at https://github.com/NVIDIA/audio-flamingo.

Introduction

Enhancing the capabilities of LLMs to decipher audio goes beyond speech and verbal content, encapsulating a broader spectrum of sound including non-verbal communication. Despite the impressive textual understanding exhibited by LLMs, their auditory comprehension has typically been confined to transcribed speech, neglecting the rich information in non-speech sounds. Current models enhancing auditory capabilities have not achieved a unified framework capable of strong audio understanding, multi-turn dialogue engagement, and quick adaptation to novel tasks without the need for fine-tuning. Addressing these gaps, the recently introduced Audio Flamingo significantly advances the state-of-the-art by incorporating in-context learning (ICL), retrieval augmented generation (RAG), and robust multi-turn dialogue capabilities.

Model Architecture and Training

Audio Flamingo differentiates itself by its novel architecture designed for efficient processing of variable-length audio inputs, capturing important temporal information lost in previous approaches. The audio feature extractor exploits a sliding window technique to preserve this information over longer audio inputs efficiently. To mitigate the challenge of excessive complexity in prior models, a cross-attention mechanism is employed, borrowing from the Flamingo model's methodology and ensuring linear complexity with respect to the number of audio tokens.

The model is further trained on a meticulously curated, heterogeneous dataset of approximately 5.9 million audio-text pairs. A two-stage training approach of pre-training and supervised fine-tuning optimizes the model's understanding of a wide array of sounds. This framework, with less than a third of the parameter count compared to certain existing methods, achieves superior performance across diverse audio understanding benchmarks.

Few-Shot Learning and Dialogue

The authors introduce several techniques to endow the Audio Flamingo with an effective few-shot learning mechanism through ICL-based RAG. The model demonstrates adeptness in quickly adapting to new tasks without the necessity of task-specific fine-tuning, setting new benchmarks for few-shot learning performance.

Moreover, the paper stakes a claim on uncharted territory by demonstrating robust multi-turn dialogue abilities. Through the creation of two multi-turn dialogue datasets using GPT-4, the authors exhibit the model's efficacy in sustained, contextually coherent conversations, significantly outperforming existing methods. This is coupled with an extensively diverse evaluation on both close-ended and open-ended tasks.

Conclusion and Future Work

Audio Flamingo sets a new standard for audio understanding models by integrating unprecedented audio perception, in-context adaptability, and dialogue prowess within a single framework. The research validates the model's supremacy through substantial benchmarks and paves the way for future exploration in scalable LLM integration, complex speech tasks, and multimodal applications. The approach to data training, model conditioning on audio, and the nuanced strategy for dataset creation contribute to the model's proficiency across the board, achieving state-of-the-art results in several capacities. The results are indicative of the transformative potential Audio Flamingo brings to the wider application of LLMs in understanding and interacting within audio-rich environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
  3. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  4. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  5. The omg-emotion behavior dataset. In 2018 International Joint Conference on Neural Networks (IJCNN), pp.  1–7. IEEE, 2018.
  6. The mtg-jamendo dataset for automatic music tagging. ICML, 2019.
  7. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
  10. Sonyc urban sound tagging (sonyc-ust): A multilabel dataset from an urban acoustic sensor network. 2019.
  11. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  646–650. IEEE, 2022.
  12. Salm: Speech-augmented language model with in-context learning for speech recognition and translation. arXiv preprint arXiv:2310.09424, 2023.
  13. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  14. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016.
  15. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  16. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022.
  17. Pengi: An audio language model for audio tasks. arXiv preprint arXiv:2305.11834, 2023.
  18. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023.
  19. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  20. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  736–740. IEEE, 2020.
  21. Botchat: Evaluating llms’ capabilities of having multi-turn dialogues. arXiv preprint arXiv:2310.13650, 2023.
  22. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023a.
  23. Natural language supervision for general-purpose audio representations, 2023b. URL https://arxiv.org/abs/2309.05767.
  24. Neural audio synthesis of musical notes with wavenet autoencoders, 2017.
  25. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
  26. Chime-home: A dataset for sound source recognition in a domestic environment. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.  1–5. IEEE, 2015.
  27. Wavprompt: Towards few-shot spoken language understanding with frozen language models. arXiv preprint arXiv:2203.15863, 2022.
  28. Llark: A multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.
  29. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  776–780. IEEE, 2017.
  30. Recap: Retrieval-augmented audio captioning. arXiv preprint arXiv:2309.09836, 2023.
  31. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  32. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183, 2023a.
  33. Joint audio and speech understanding. In IEEE Automatic Speech Recognition and Understanding Workshop, 2023b.
  34. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023c.
  35. Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
  36. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
  37. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  366–370. IEEE, 2021.
  38. An exploration of in-context learning for speech language model. arXiv preprint arXiv:2310.12477, 2023.
  39. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
  40. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
  41. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
  42. An open source emotional speech corpus for human robot interaction applications. Interspeech 2018, 2018.
  43. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.  17–21. IEEE, 2022.
  44. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  45. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  46. Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization. In 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP), pp.  100–103. IEEE, 2016.
  47. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, 2019.
  48. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
  49. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  50. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19108–19118, 2022.
  51. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  52. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
  53. Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107, 2023c.
  54. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  55. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pp.  1140–1144. IEEE, 2022.
  56. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  57. Music understanding llama: Advancing text-to-music generation with question answering and captioning. arXiv preprint arXiv:2308.11276, 2023b.
  58. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  59. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  60. Medley-solos-DB: a cross-collection dataset for musical instrument recognition, February 2019. URL https://doi.org/10.5281/zenodo.1344103.
  61. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4):471–483, 2017.
  62. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  63. Diversity and bias in audio captioning datasets. 2021.
  64. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
  65. Mohanty, S. P. Sound Of 114 Species Of Birds Till 2022. URL https://www.kaggle.com/datasets/soumendraprasad/sound-of-114-species-of-birds-till-2022.
  66. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  67. Audio retrieval with natural language queries. arXiv preprint arXiv:2105.02192, 2021.
  68. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  69. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  70. Enemy spotted: In-game gun sound dataset for gunshot classification and localization. In 2022 IEEE Conference on Games (CoG), pp.  56–63. IEEE, 2022.
  71. Toronto emotional speech set (TESS), 2020. URL https://doi.org/10.5683/SP2/E8H2MF.
  72. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018.
  73. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
  74. Musdb18-hq - an uncompressed version of musdb18, August 2019. URL https://doi.org/10.5281/zenodo.3338373.
  75. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Processing, 17(6):e12233, 2023.
  76. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  77. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020. URL https://arxiv.org/abs/2004.09813.
  78. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  79. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, pp.  1041–1044, 2014.
  80. Zero-shot audio captioning with audio-language model guidance and audio context keywords. arXiv preprint arXiv:2311.08396, 2023.
  81. Sturm, B. L. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use. arXiv preprint arXiv:1306.1461, 2013.
  82. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023a.
  83. Codi-2: In-context, interleaved, and interactive any-to-any generation. arXiv preprint arXiv:2311.18775, 2023b.
  84. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  85. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  86. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  87. Can whisper perform speech-based in-context learning. arXiv preprint arXiv:2309.07081, 2023.
  88. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  89. A foundation model for music informatics. arXiv preprint arXiv:2311.03318, 2023.
  90. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  91. Re-vilm: Retrieval-augmented visual language model for zero and few-shot image captioning. In EMNLP, 2023.
  92. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  93. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhifeng Kong (26 papers)
  2. Arushi Goel (18 papers)
  3. Rohan Badlani (13 papers)
  4. Wei Ping (51 papers)
  5. Rafael Valle (31 papers)
  6. Bryan Catanzaro (123 papers)
Citations (49)
Youtube Logo Streamline Icon: https://streamlinehq.com