MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (2410.19168v1)
Abstract: The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-LLMs, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528, 2023.
- The mtg-jamendo dataset for automatic music tagging. ICML, 2019.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
- Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Anna Korhonen, David Traum, and LluÃs Mà rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4619–4629, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1455. URL https://aclanthology.org/P19-1455.
- Salm: Speech-augmented language model with in-context learning for speech recognition and translation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13521–13525. IEEE, 2024.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
- Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Musilingo: Bridging music and text with pre-trained language models for music captioning and query response. arXiv preprint arXiv:2309.08730, 2023.
- Pengi: An audio language model for audio tasks. Advances in Neural Information Processing Systems, 36:18090–18108, 2023.
- Audio entailment: Assessing deductive reasoning for audio understanding. arXiv preprint arXiv:2407.18062, 2024.
- Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024a.
- Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024b.
- Llark: A multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.
- Openagi: When llm meets domain experts. Advances in Neural Information Processing Systems, 36, 2024.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017.
- The development of visual search in infants and very young children. Journal of Experimental Child Psychology, 81(2):194–215, 2002.
- Compa: Addressing the gap in compositional reasoning in audio-language models. arXiv preprint arXiv:2310.08753, 2023.
- Vdgd: Mitigating lvlm hallucinations in cognitive prompts by bridging the visual perception gap. arXiv preprint arXiv:2405.15683, 2024a.
- Reclap: Improving zero shot audio classification by describing sounds. arXiv preprint arXiv:2409.09213, 2024b.
- Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities. arXiv preprint arXiv:2406.11768, 2024c.
- Yuan Gong. From audio perception to understanding: A path towards audio agi. In AI Seminar. Stony Brook University, 2024. URL https://ai.stonybrook.edu/Audio-Perception-Understanding-Path-Towards-Audio-AGI.
- Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, 2023a. doi: 10.1109/ASRU57964.2023.10389742.
- Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8. IEEE, 2023b.
- Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023c.
- Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. IEEE, 2022.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 366–370. IEEE, 2021.
- Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22170–22183, 2024.
- M ^{{\{{2}}\}} ugen: Multi-modal music understanding and generation with the power of large language models. arXiv preprint arXiv:2311.11255, 2023.
- Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6735–6739. IEEE, 2024.
- Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. arXiv preprint arXiv:2402.01831, 2024.
- Artificial general intelligence (agi) for education. arXiv preprint arXiv:2304.12479, 1, 2023.
- Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024.
- Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107, 2023.
- Richard P. Lippmann. Speech recognition by machines and humans. Speech Communication, 22(1):1–15, 1997. ISSN 0167-6393. doi: https://doi.org/10.1016/S0167-6393(97)00021-6. URL https://www.sciencedirect.com/science/article/pii/S0167639397000216.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Music understanding llama: Advancing text-to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 286–290. IEEE, 2024b.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- The song describer dataset: a corpus of audio captions for music-and-language evaluation. arXiv preprint arXiv:2311.10057, 2023.
- Mustango: Toward controllable text-to-music generation. arXiv preprint arXiv:2311.08355, 2023.
- Position: Levels of AGI for operationalizing progress on the path to AGI. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 36308–36321. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/morris24b.html.
- Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
- Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023.
- Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018.
- Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp. 28492–28518. PMLR, 2023.
- The musdb18 corpus for music separation. 2017.
- Ai and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366, 2021.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
- Dear: Debiasing vision-language models with additive residuals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6820–6829, 2023.
- Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530.
- Open and efficient foundation language models. Preprint at arXiv. https://doi. org/10.48550/arXiv, 2302, 2023a.
- Llama: Open and efficient foundation language models, 2023b. URL https://arxiv.org/abs/2302.13971.
- Repurposing entailment for multi-hop question answering tasks. arXiv preprint arXiv:1904.09380, 2019.
- Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Audiobench: A universal benchmark for audio large language models, 2024a. URL https://arxiv.org/abs/2406.16020.
- Muchin: A chinese colloquial description benchmark for evaluating language models in the field of music. arXiv preprint arXiv:2402.09871, 2024b.
- Muchomusic: Evaluating music understanding in multimodal audio-language models. arXiv preprint arXiv:2408.01337, 2024.
- A foundation model for music informatics. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1226–1230. IEEE, 2024.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- Guitarset: A dataset for guitar transcription. In ISMIR, pp. 453–460, 2018.
- Artificial general intelligence is already here. Noema, October, 2023.
- Air-bench: Benchmarking large audio-language models via generative comprehension, 2024. URL https://arxiv.org/abs/2402.07729.
- How far are we from agi. In ICLR 2024 Workshops, 2024.
- Marble: Music audio representation benchmark for universal evaluation. Advances in Neural Information Processing Systems, 36:39626–39647, 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9556–9567, 2024.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a. URL https://arxiv.org/abs/2306.02858.
- Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023b.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=1tZbq88f27.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.