Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models (2501.04962v1)

Published 9 Jan 2025 in cs.CL, cs.SD, and eess.AS

Abstract: With the growing demand for developing speech-based interaction models, end-to-end Spoken LLMs (SLMs) have emerged as a promising solution. When engaging in conversations with humans, it is essential for these models to comprehend a wide range of world knowledge. In this paper, we introduce VoxEval, a novel speech question-answering benchmark specifically designed to assess SLMs' knowledge understanding through purely speech-based interactions. Unlike existing AudioQA benchmarks, VoxEval maintains speech format for both questions and answers, evaluates model robustness across diverse audio conditions (varying timbres, audio qualities, and speaking styles), and pioneers the assessment of challenging domains like mathematical problem-solving in spoken format. Our comprehensive evaluation of recent SLMs using VoxEval reveals significant performance limitations in current models, highlighting crucial areas for future improvements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wenqian Cui (7 papers)
  2. Xiaoqi Jiao (8 papers)
  3. Ziqiao Meng (12 papers)
  4. Irwin King (170 papers)