BLAB: Brutally Long Audio Bench (2505.03054v2)

Published 5 May 2025 in cs.AI, cs.CL, cs.SD, and eess.AS

Abstract: Developing large audio LLMs (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.

Summary

The paper introduces the BLAB benchmark, specifically designed to evaluate audio language models on diverse, long-form audio tasks averaging 51 minutes using over 833 hours of data.
The evaluation reveals significant performance limitations in current audio LMs when handling extended audio, with models struggling particularly on localization (<5% F1) and temporal reasoning tasks.
The findings underscore the critical need for advancements in audio LM architectures and open-source models to effectively process long-duration audio and enhance real-world applications.

An Expert Analysis of "BLAB: Brutally Long Audio Bench"

The paper "BLAB: Brutally Long Audio Bench" introduces an innovative benchmark designed to evaluate large audio LLMs (LMs) on long-form audio tasks. Unlike traditional benchmarks, which focus primarily on short audio segments, this research highlights the challenges and requirements for processing lengthy audio data, offering new insights into the true capabilities of audio LMs.

Overview

The authors have created the Brutally Long Audio Bench (BLAB), a comprehensive benchmark for evaluating audio LMs using diverse tasks such as localization, duration estimation, emotion recognition, and counting, each requiring nuanced understanding of audio clips with an average length of 51 minutes. BLAB consists of over 833 hours of audio data sourced from Creative Commons-licensed materials, supplemented with human-annotated, text-based questions and answers. The authors present a rigorous evaluation of several audio LMs, including renowned models like Gemini 2.0 Pro and GPT-4o, revealing limitations in their ability to handle long-duration audio tasks.

Key Findings

The paper identifies significant difficulties encountered by current audio LMs, particularly when dealing with extended audio durations. The evaluation results show a marked decline in performance as audio duration increases. Notably, models struggle with tasks requiring localization, temporal reasoning, and counting, often relying more on text prompts than audio content itself. This insight is supported by quantifiable data, such as the low $F_1$ scores (below 5%) on localization tasks and exact match accuracy below 23% for duration and counting tasks.

Implications

The implications of this paper are profound, both on theoretical and practical fronts. The findings underscore the pressing need to advance audio LM architectures to better accommodate long-form audio inputs. This is particularly relevant for applications that require understanding evolving prosodic cues, retaining context, and discerning temporal relationships in lengthy audio recordings. Furthermore, the research highlights the necessity for open-source multimodal LMs with transparent documentation on training data, architectures, and methodologies to foster deeper insights into model performance.

Future Directions

Looking ahead, the paper advocates for developing innovative approaches to improve audio LMs' performance on long-form audio tasks. The introduction of BLAB sets the stage for future research aimed at enhancing audio-text integration in LLMs, exploring new architectures, and developing models capable of robustly handling extended audio inputs. Expanding the context range of audio LMs could significantly enhance accessibility and utility across various user populations, facilitating more natural and intuitive human-computer interactions.

In summary, "BLAB: Brutally Long Audio Bench" marks a significant contribution to the domain of audio LLMs, emphasizing the complexities of processing long-form audio and setting a foundational framework for advancing multimodal language technologies.

Related Papers

Tweets

https://twitter.com/AudioAndSpeech/status/1920039947571991019

YouTube

Show All Videos