UltraEval-Audio: Unified Audio Evaluation
- UltraEval-Audio is a unified, modular framework designed to evaluate audio foundation models across understanding, generation, and codec tasks.
- It standardizes data preparation, model deployment, and evaluation using YAML-driven, one-command reproducibility with detailed benchmark metrics.
- The framework supports multiple languages—including non-English benchmarks—and facilitates fair, cross-model comparisons via a public leaderboard.
UltraEval-Audio is a unified, modular framework for the comprehensive evaluation of audio foundation models (AFMs), spanning audio understanding, generation, and codec tasks across multiple languages and domains. Its architecture, benchmark suite, evaluation methodology, and extensibility are specifically tailored to address long-standing challenges in fair cross-model comparison, holistic codec assessment, and the inclusion of non-English—particularly Chinese—benchmarks. UltraEval-Audio is public and open-source, with a real-time leaderboard and one-command reproducibility for both academic and industrial use (Shi et al., 4 Jan 2026).
1. System Architecture and Design Principles
UltraEval-Audio is structured around three integrated modules: Data Preparation, Model Deployment, and Evaluation (Shi et al., 4 Jan 2026). Data Preparation encompasses a standardized loader for audio fields, automatic format normalization (sampling rate, channel layout), and a flexible, task-specific prompt management system utilizing YAML and Jinja2 templates. For Model Deployment, each AFM is isolated within its own runtime (virtual environment or container), with a uniform Python .inference() API—abstracting remote, local, or proprietary deployments. The Evaluation module post-processes model outputs, applies rule-based (WER, BLEU, accuracy) and model-based (UTMOS, speaker embedding similarity, GPT-scored QA) evaluators, and aggregates results for leaderboard reporting.
Configuration-driven operation allows researchers to define the complete evaluation pipeline via YAML files, specifying tasks, prompt templates, models, and output formats. A single command-line call executes the full protocol end-to-end, with results written to timestamped JSON/CSV and pushed to a public leaderboard. Extensibility is achieved through modular Python base classes and the separation of data, prompts, models, and evaluators.
2. Supported Languages, Tasks, and Benchmarks
UltraEval-Audio covers 10 languages (English, Chinese, Russian, German, Japanese, French, Italian, Dutch, Polish, Portuguese) and organizes its 14 core task categories into three overarching groups: Audio Understanding, Audio Generation, and Audio Codec. The framework integrates 36 authoritative benchmarks, with careful representation of diverse domains (speech, music, environmental sounds, multimodal QA) (Shi et al., 4 Jan 2026).
| Task Group | Example Tasks | Benchmark Examples | Core Metrics |
|---|---|---|---|
| Audio Understanding | ASR, AST, Gender/Emotion/Music ID | LibriSpeech, NSynth, CoVoST 2 | WER, BLEU, ACC |
| Audio Generation | TTS, Voice Cloning, Speech QA | Long-TTS-Eval, Seed-TTS-Eval | WER, SIM, UTMOS |
| Audio Codec | Speech Codec evaluation | LibriSpeech, AISHELL-1 | WER, SIM, DNSMOS |
The evaluation suite includes major datasets such as LibriSpeech (English ASR), CommonVoice (multi-language ASR), GTZAN (music genre), DESED (environmental sound), AudioCaps and Clotho (audio captioning), and proprietary Chinese resources like SpeechCMMLU (Chinese general knowledge MCQ) and SpeechHSK (HSK-based listening comprehension) (Shi et al., 4 Jan 2026).
3. Model Integration and Inference Standardization
UltraEval-Audio supports direct integration of 24 mainstream AFMs, covering both proprietary APIs (e.g., GPT-4o-Realtime, Gemini-2.5-Flash), and open-source or local models (e.g., Qwen2.5-Omni, MiniCPM-o 2.6, ChatTTS-DVAE, EnCodec, SoundStream). For each model, a dedicated adapter implements a standardized interface, abstracting deployment differences and harmonizing input/output for subsequent evaluation.
Sample integration patterns include:
- Remote API: e.g., GPT-4o-Realtime via HTTP POST, returning both text and audio for analysis.
- Local Deployment: e.g., Qwen3-Omni-30B-A3B-Instruct via vLLM, streaming text/audio tokens and decoding via a local vocoder.
- Containerized: e.g., MiniCPM-o 2.6 under Docker with Torch 2.0, direct waveform I/O.
This uniformity ensures apples-to-apples comparisons across high-level metrics and facilitates reproducible benchmarking (Shi et al., 4 Jan 2026).
4. Evaluation Protocols and Metrics
UltraEval-Audio employs a comprehensive set of evaluation strategies, incorporating existing practices from open-source tools like AquaTk (Vinay et al., 2023), domain-specific frameworks such as PodEval (Xiao et al., 1 Oct 2025), and the scalar/vector protocols of X-ARES (Zhang et al., 22 May 2025). Evaluation is performed at the task-category level with explicit metric associations:
- ASR/AST: WER (word error rate), CER (character error rate), BLEU (translation).
- Classification (emotion, gender, instrument, genre): Accuracy.
- TTS/VC: ASR-WER (for intelligibility), SIM (cosine similarity of speaker embeddings), UTMOS/DNSMOS (predicted MOS scales).
- Audio Captioning: BLEU, ROUGE-L (text similarity).
- Speech Codec: Triad of Semantic Accuracy (WER/CER), Timbre Fidelity (SIM), and Acoustic Quality (UTMOS, DNSMOS P.835/P.808, PESQ/STOI optional).
Metrics are computed per-benchmark and can be aggregated into leaderboards with radar visualizations in the case of codecs (Shi et al., 4 Jan 2026). The post-evaluation pipeline supports both batch and streaming operation and is configured by task via YAML.
5. Specialized Codec Evaluation Scheme
UltraEval-Audio introduces a holistic speech codec evaluation methodology, addressing the gap in codec assessment (Shi et al., 4 Jan 2026). The protocol is defined as follows:
- Semantic Accuracy: Measured by WER/CER between reference and reconstructed audio:
where = substitutions, = deletions, = insertions, = number of reference words.
- Timbre Fidelity: Cosine similarity between speaker embeddings (, ) extracted by a WavLM-large model fine-tuned for speaker verification:
- Acoustic Quality: Objective non-intrusive metrics:
- UTMOS: 0–5 MOS range (utterance-level mean opinion score estimator)
- DNSMOS P.835/P.808: neural MOS proxies for speech/noise quality and overall distortion.
Supplementary metrics include PESQ [ITU-T P.862] and STOI, supporting fine-grained perceptual and intelligibility ranking.
Leaderboard and visualization tools combine these orthogonal metrics, enabling comprehensive codec comparison under a unified schema.
6. Multilingual and Chinese-Specific Benchmarks
To overcome the heavy Anglocentricity of prior AFM evaluation, UltraEval-Audio introduces SpeechCMMLU and SpeechHSK (Shi et al., 4 Jan 2026):
- SpeechCMMLU: A multiple-choice QA audio benchmark derived from the CMMLU text suite, filtered and synthesized via CosyVoice 2 TTS, with ASR-verification for transcription accuracy. Evaluation metric is exact-match accuracy over 3,519 Chinese MCQ samples.
- SpeechHSK: Based on official HSK listening exams (Levels 1–6), with 170 re-recorded question audios covering varied difficulty. Scoring is by answer accuracy.
Baselines on SpeechCMMLU (GPT-4o-Realtime: 70.05%; Qwen2.5-Omni: 73.72%) and SpeechHSK (GPT-4o-Realtime: 98.69% on answer selection) illustrate current model capacity and underscore the importance of these benchmarks for true cross-lingual capability assessment.
7. Extensibility, Usage, and Future Directions
UltraEval-Audio is designed for rapid extensibility:
- New tasks: Defined in YAML under
configs/tasks/, specifying dataset, split, and prompt references. Custom metrics require subclassing a base evaluator. - Languages and datasets: Added through simple configuration and extension of the data loader.
- Models: Registered by implementing
.inference()adapters per model, requiring only light touch editing of model and prompt registries.
The public codebase, leaderboard, and documentation at https://github.com/OpenBMB/UltraEval-Audio (Shi et al., 4 Jan 2026) enable direct academic use, community contributions, and model/corpus expansion. The system is configuration- and template-driven, enforcing rigorous, reproducible, and transparent evaluation. A plausible implication is that future AFM evaluation standards may converge towards this declarative, modular paradigm, leveraging open-source infrastructures and multi-dimensional metrics as operationalized in UltraEval-Audio.
References
- UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models (Shi et al., 4 Jan 2026)
- AQUATK: An Audio Quality Assessment Toolkit (Vinay et al., 2023)
- X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance (Zhang et al., 22 May 2025)
- PodEval: A Multimodal Evaluation Framework for Podcast Audio Generation (Xiao et al., 1 Oct 2025)
- AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation (Wang et al., 16 Oct 2025)
- AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation (Manakul et al., 17 Jul 2025)