Papers
Topics
Authors
Recent
2000 character limit reached

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Published 26 Sep 2025 in cs.CL, cs.AI, cs.CV, cs.HC, and cs.SD | (2509.22651v1)

Abstract: The growing capabilities of LLMs and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .

Summary

  • The paper introduces VoiceAssistant-Eval, a benchmark with 10,497 QA items across listening, speaking, and viewing tasks.
  • It employs triadic evaluation metrics combining content quality, speech naturalness, and text-speech consistency, validated against human judgments.
  • The analysis reveals that mid-sized, specialized models can outperform larger ones, though challenges persist in audio-visual integration and safety alignment.

Comprehensive Benchmarking of Voice-First AI Assistants: The VoiceAssistant-Eval Framework

Motivation and Benchmark Design

VoiceAssistant-Eval addresses critical gaps in the evaluation of multimodal AI assistants, specifically those designed for voice-first interaction. Existing benchmarks are limited by their focus on text-based instructions, lack of voice personalization, insufficient coverage of real-world audio contexts, and inadequate assessment of multimodal (audio+vision) integration. VoiceAssistant-Eval introduces a large-scale, rigorously curated benchmark comprising 10,497 QA items across 13 task categories, spanning listening (audio, speech, music), speaking (dialogue, role-play, emotion, safety), and viewing (image-based reasoning with audio/text queries).

The benchmark is constructed from 47 diverse datasets, with careful attention to data quality, balance, and scenario realism. Speech instructions are synthesized using advanced TTS models (F5TTS, ChatTTS, Dia-1.6B), and only high-fidelity audio (UTMOS > 3.8, minimal WER via Whisper) is retained. The benchmark explicitly tests four previously underexplored capabilities: personalized voice imitation, hands-free audio interaction, multimodal audio-visual reasoning, and audio QA under complex contexts. Figure 1

Figure 1

Figure 1: (a) Scores of six prominent omni-models across 13 tasks. (b) Examples from three newly designed tasks for voice assistants: I. Role-play with reference audio; II. Voice-based multi-turn conversation; III. Vision+audio integration; IV. Audio question with music context.

Evaluation Protocols and Metrics

VoiceAssistant-Eval employs a triadic evaluation system, scoring model outputs on content quality (via gpt-oss-20b with task-specific prompts), speech quality (UTMOS), and text-speech consistency (modified WER). For role-play, speaker similarity is measured using Wespeaker embeddings. The final task score is the product of these metrics, normalized to percentage accuracy. This protocol enables integrated, modality-aware assessment, in contrast to prior benchmarks that report metrics independently.

Automated evaluation is validated against human judgments, with Pearson correlations exceeding 0.9 and agreement rates above 96% across all tasks, confirming reliability. Stability analysis over repeated runs yields narrow IQRs and low variance, demonstrating reproducibility. Figure 2

Figure 2: Stability of automated evaluation across repeated runs. Boxplots show Qwen2.5-Omni-7B’s scores for each task over ten runs; narrow IQRs confirm repeatability.

Model Performance Analysis

Twenty-one open-source models and GPT-4o-Audio are evaluated. Key findings include:

  • Proprietary models do not universally outperform open-source models. GPT-4o-Audio fails to surpass open-source models in 4/13 tasks, notably in Listening Sound and Listening Speech, where Step-Audio-2-mini achieves higher accuracy.
  • Speaking tasks are easier than listening tasks. 20/22 models score higher on speaking than listening, indicating a persistent bottleneck in audio understanding, especially for non-speech audio and music.
  • Smaller, well-designed models can rival larger ones. Step-Audio-2-mini (7B) more than doubles the listening accuracy of LLaMA-Omni2-32B-Bilingual (40.06 vs. 16.00), and Qwen2.5-Omni-7B achieves competitive overall scores.
  • Role-play and multimodal integration remain challenging. Step-Audio achieves the highest content and speaker similarity in role-play but low speech naturalness, while Qwen2.5-Omni-7B exhibits a 16.3-point drop in accuracy for image+audio queries compared to image+text.
  • Safety alignment and robustness are inconsistent. Some models (e.g., Moshika family) score below 1 in robustness and below 28 in safety, while Freeze-Omni achieves 79.8 in safety, highlighting the impact of explicit alignment training. Figure 3

    Figure 3: Accuracy of multi-modal models on identical questions across two modalities. All models perform substantially worse when queries are spoken rather than written, illustrating the gap in robust audio-visual integration.

Error Analysis and Model Limitations

A detailed error analysis of Qwen2.5-Omni-7B reveals:

  • Listening errors are dominated by context loss (46%), speech perception (16%), and sound perception (15%), indicating limited audio memory and basic recognition failures.
  • Speaking errors are mainly insufficient answers (25%) and requirement deviation (23%), with notable failures in maintaining role-play style (13%).
  • Viewing errors are primarily vision perception errors (50%), followed by knowledge (19%) and reasoning (15%) errors, and context loss (12%). Figure 4

    Figure 4: Error analysis of Qwen2.5-Omni-7B across listening, speaking, and viewing tasks.

Implications and Future Directions

VoiceAssistant-Eval establishes a rigorous, reproducible framework for benchmarking voice-first AI assistants. The results demonstrate that current models are proficient in speech generation and simple dialogue but lag in audio understanding and multimodal reasoning. The strong performance of mid-sized, specialized models suggests that architectural and data-centric improvements can yield substantial gains without scaling parameter count.

Persistent weaknesses in audio-visual integration, role-play fidelity, and safety alignment highlight the need for:

  • Enhanced audio encoders and memory mechanisms for robust listening.
  • Integrated multimodal training to bridge the gap between spoken and written queries in visual reasoning.
  • Refined alignment and robustness protocols to ensure safe, reliable deployment in real-world scenarios.

The benchmark’s comprehensive coverage and validated metrics provide a foundation for transparent, longitudinal assessment of progress in voice-enabled AI. Future work should expand linguistic diversity, task realism (e.g., streaming, interactive evaluation), and coverage of dynamic scenarios (e.g., video-audio integration).

Conclusion

VoiceAssistant-Eval delivers a comprehensive, modality-integrated benchmark for evaluating AI assistants across listening, speaking, and viewing. The analysis reveals that while current models excel in speech generation and basic dialogue, they are limited in audio understanding and multimodal integration. The benchmark’s findings inform the design of next-generation assistants, emphasizing the importance of balanced development across modalities, targeted model specialization, and robust safety alignment. VoiceAssistant-Eval will serve as a critical resource for tracking and accelerating progress toward truly expert, voice-first AI systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.