AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Published 9 Sep 2025 in cs.SD, cs.AI, cs.LG, and eess.AS | (2509.08031v2)

Abstract: Large Audio LLMs (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient toolkits that limit fair comparison and systematic assessment. Current frameworks suffer from three critical issues: slow processing that bottlenecks large-scale studies, inconsistent prompting that hurts reproducibility, and narrow task coverage that misses important audio reasoning capabilities. We introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 127% over existing toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. Additionally, we introduce two new evaluation categories: LLM-Adaptive Diarization for temporal audio understanding and Spoken Language Reasoning for complex audio-based cognitive tasks. Through evaluation across 380+ tasks, we reveal significant gaps in current LALMs, particularly in temporal understanding and complex spoken language reasoning tasks. Our findings also highlight a lack of standardization in instruction modality existent across audio benchmarks, which can lead up performance differences up to 9.5 absolute points on the challenging complex instruction following downstream tasks. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Abstract PDF Upgrade to Chat

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (8)

Collections

Tweets

YouTube

Show All Videos

alphaXiv

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs (10 likes, 0 questions)

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

YouTube

alphaXiv

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets

YouTube

alphaXiv

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research