VoiceAssistant-Eval: Speech AI Benchmark
- VoiceAssistant-Eval is a comprehensive benchmarking framework that evaluates AI assistants across listening, speaking, and viewing modalities under realistic, audio-only conditions.
- It organizes 10,497 curated examples into distinct task categories to assess content quality, speech quality, and text-speech consistency using standardized metrics.
- The framework enforces a strict voice-first protocol with zero-shot, audio-only prompting to drive quantitative comparisons among proprietary and open-source models.
VoiceAssistant-Eval is a comprehensive and rigorous benchmarking framework for evaluating AI assistants leveraging speech across diverse input (listening), output (speaking), and perception (viewing) modalities. It was introduced to address deficits in previous benchmarks that limited assessment to isolated speech recognition or dialogue performance, and did not capture holistic capabilities such as voice imitation, hands-free situated interaction, safety alignment, or multimodal reasoning. VoiceAssistant-Eval enables systematic, quantitative analysis of diverse speech-first assistant systems—including both proprietary and open-source models—under real-world, audio-only prompting and challenging agentic requirements (Wang et al., 26 Sep 2025).
1. Benchmark Scope and Structure
VoiceAssistant-Eval consists of 10,497 curated examples grouped into three main modalities:
- Listening (2,692 items, 25.6%): Evaluates audio-only comprehension of speech, music, and environmental sounds. Subtasks include general sound identification, speech language/gender ID, music genre/emotion judgment, and word/sound discrimination under realistic and diverse acoustic conditions.
- Speaking (6,905 items, 65.8%): Assesses hands-free spoken-response generation. Subtasks cover multi-turn dialogue, role-play (voice timbre/style imitation), emotional/empathetic responses, instruction-following, robust handling of noisy or out-of-distribution phrasing, and explicit safety (refusal/completion of disallowed requests).
- Viewing (900 items, 8.6%): Multimodal image + spoken question tasks (drawn from MMMU) require visual+auditory reasoning, with a focus on extracting structured answers from images and oral queries.
The benchmark features a strict "voice-first" evaluation protocol: all instructions and prompts are synthesized as speech (not text), enforcing end-to-end spoken comprehension and disallowing text-prior adaptation (Wang et al., 26 Sep 2025).
2. Task Categories and Dataset Design
The dataset is meticulously partitioned into 13 task categories across the three modalities. Notable design characteristics include:
- Listening: Four main subtasks—general sounds, music, sound discrimination, and speech-specific Q&A.
- Speaking: Eight focused subtasks—open-domain assistant, emotional response, instruction following, multi-turn interaction, mathematical/civic reasoning, robustness under noise, role-play imitation (with explicit speaker similarity evaluation), and safety-critical refusal.
- Viewing: Joint audio-visual reasoning, where spoken queries relate to diverse image types including diagrams, charts, and technical schematics.
Each subtask is grounded in realistic use-case scenarios, spans multiple languages, and embeds significant diversity in acoustic and visual context, robustly supporting quantitative and qualitative error analysis.
3. Evaluation Metrics and Methodology
VoiceAssistant-Eval employs a triadic evaluation framework per item:
- Content Quality (C): Measured by a GPT-judge (gpt-oss-20b), which compares the model's response text to a reference, giving a "Good/Correct" score or soft-graded value in [0,1].
- Speech Quality (S): Assessed via UTMOS—a universal text-mediated mean opinion score predictor—normalized as , mapping from a [1,5] MOS scale to [0,1].
- Text–Speech Consistency (K): Computed as , where WER' is a modified word error rate between the model's returned text and the ASR-transcribed output of generated speech (using Whisper-Large-v3). Special cases handle very short outputs.
The final score for each item is , yielding an interpretable percentage (Wang et al., 26 Sep 2025).
Role-play imitation tasks additionally use Wespeaker similarity (embedding-based timbre match) directly in the scoring.
4. Experimental Protocol and Models Benchmarked
The evaluation protocol mandates zero-shot, audio-only prompting. All prompts are synthesized and played; no textual instructions are provided to models. Benchmarking uses a standardized inference setup (beam size, temperature, max tokens).
Twenty-one open-source models and GPT-4o-Audio were evaluated, including:
- Small models (<4B): mini-omni, Baichuan-Omni 1.5B, etc.
- Mid-size (4–10B): Step-Audio-2-mini (7B), Qwen2.5-Omni 7B, Llama-3.1-8B-Omni, Kimi-Audio 7B.
- Large (10–130B): Step-Audio 130B, LLaMA-Omni2-32B.
- Proprietary: GPT-4o-Audio.
This enables robust comparative assessment of model scaling, architecture, and backbone design trade-offs (Wang et al., 26 Sep 2025).
5. Key Results and Comparative Analysis
The following table summarizes characteristic findings:
| Task Domain | Top Model(s) | Score (%) | Notable Findings |
|---|---|---|---|
| Listening | Step-Audio-2-mini (7B) | 40.06 | Outperforms LLaMA-Omni2-32B (16.00%) despite much smaller size. |
| Speaking | Kimi-Audio 7B, GPT-4o-Audio | ~36-62 | Proprietary models excel on speaking/reasoning, not uniformly better. |
| Viewing | Qwen2.5-Omni 7B (text-prompt), Qwen2.5-Omni 7B (voice-prompt) | 59.2 → 42.9 | All models degrade 10–16 points when moved from text to spoken vision Q&A. |
| Safety | Freeze-Omni | 79.8 | High safety/robustness, outperforming many other open-source models. |
Key observations:
- Proprietary (GPT-4o-Audio) does not dominate: in four out of thirteen tasks, open-source models outperform the proprietary baseline.
- Speaking tasks are generally easier for all models; average speaking subtask score is ∼36% vs. ∼24% for listening.
- Role-play tasks highlight a trade-off between content fidelity and speaker similarity; most models excel at one or the other but not both.
- Model size alone does not determine listening performance; specialized audio-encoder architectures (Step-Audio family) close or surpass gaps to large LLM