Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Abstract: Augmenting LLMs to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio LLM with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks. Our demo website is https://audioflamingo.github.io/ and the code is open-sourced at https://github.com/NVIDIA/audio-flamingo.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- The omg-emotion behavior dataset. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE, 2018.
- The mtg-jamendo dataset for automatic music tagging. ICML, 2019.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.
- Sonyc urban sound tagging (sonyc-ust): A multilabel dataset from an urban acoustic sensor network. 2019.
- Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650. IEEE, 2022.
- Salm: Speech-augmented language model with in-context learning for speech recognition and translation. arXiv preprint arXiv:2310.09424, 2023.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
- Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022.
- Pengi: An audio language model for audio tasks. arXiv preprint arXiv:2305.11834, 2023.
- Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. IEEE, 2020.
- Botchat: Evaluating llms’ capabilities of having multi-turn dialogues. arXiv preprint arXiv:2310.13650, 2023.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023a.
- Natural language supervision for general-purpose audio representations, 2023b. URL https://arxiv.org/abs/2309.05767.
- Neural audio synthesis of musical notes with wavenet autoencoders, 2017.
- Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
- Chime-home: A dataset for sound source recognition in a domestic environment. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5. IEEE, 2015.
- Wavprompt: Towards few-shot spoken language understanding with frozen language models. arXiv preprint arXiv:2203.15863, 2022.
- Llark: A multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017.
- Recap: Retrieval-augmented audio captioning. arXiv preprint arXiv:2309.09836, 2023.
- Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
- Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183, 2023a.
- Joint audio and speech understanding. In IEEE Automatic Speech Recognition and Understanding Workshop, 2023b.
- Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023c.
- Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
- Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
- The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 366–370. IEEE, 2021.
- An exploration of in-context learning for speech language model. arXiv preprint arXiv:2310.12477, 2023.
- Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
- Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
- Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022.
- An open source emotional speech corpus for human robot interaction applications. Interspeech 2018, 2018.
- Cochlscene: Acquisition of acoustic scene data using crowdsourcing. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 17–21. IEEE, 2022.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- Accelerating recurrent neural network training using sequence bucketing and multi-gpu data parallelization. In 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP), pp. 100–103. IEEE, 2016.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132, 2019.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19108–19118, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
- Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107, 2023c.
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pp. 1140–1144. IEEE, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Music understanding llama: Advancing text-to-music generation with question answering and captioning. arXiv preprint arXiv:2308.11276, 2023b.
- The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Medley-solos-DB: a cross-collection dataset for musical instrument recognition, February 2019. URL https://doi.org/10.5281/zenodo.1344103.
- Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4):471–483, 2017.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
- Diversity and bias in audio captioning datasets. 2021.
- Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
- Mohanty, S. P. Sound Of 114 Species Of Birds Till 2022. URL https://www.kaggle.com/datasets/soumendraprasad/sound-of-114-species-of-birds-till-2022.
- Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
- Audio retrieval with natural language queries. arXiv preprint arXiv:2105.02192, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Enemy spotted: In-game gun sound dataset for gunshot classification and localization. In 2022 IEEE Conference on Games (CoG), pp. 56–63. IEEE, 2022.
- Toronto emotional speech set (TESS), 2020. URL https://doi.org/10.5683/SP2/E8H2MF.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Musdb18-hq - an uncompressed version of musdb18, August 2019. URL https://doi.org/10.5281/zenodo.3338373.
- Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Processing, 17(6):e12233, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
- Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020. URL https://arxiv.org/abs/2004.09813.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
- A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 1041–1044, 2014.
- Zero-shot audio captioning with audio-language model guidance and audio context keywords. arXiv preprint arXiv:2311.08396, 2023.
- Sturm, B. L. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use. arXiv preprint arXiv:1306.1461, 2013.
- Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023a.
- Codi-2: In-context, interleaved, and interactive any-to-any generation. arXiv preprint arXiv:2311.18775, 2023b.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
- Can whisper perform speech-based in-context learning. arXiv preprint arXiv:2309.07081, 2023.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- A foundation model for music informatics. arXiv preprint arXiv:2311.03318, 2023.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Re-vilm: Retrieval-augmented visual language model for zero and few-shot image captioning. In EMNLP, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.