VoiceBench: Benchmarking Voice Assistants

Updated 1 July 2026

VoiceBench is a multi-faceted benchmark suite that systematically assesses the performance and robustness of speech-enabled LLM agents under diverse acoustic, accented, and content-perturbed conditions.
It integrates both real and synthetic audio samples to evaluate instruction-following, reasoning, comprehension, and safety compliance of voice assistants.
Using comprehensive metrics such as accuracy scores, WER, and safety refusal rates, VoiceBench informs practical improvements in multi-modal AI systems.

VoiceBench is a multi-faceted benchmark suite designed to systematically evaluate the capabilities and limitations of LLM-based voice assistants under realistic, diverse, and challenging speech interaction scenarios. Conceived to transcend the narrow foci of prior evaluation suites—predominantly ASR accuracy or QA with clean speeches—VoiceBench rigorously assesses the downstream understanding, instruction-following, reasoning, and safety compliance of speech-enabled agents across a spectrum of user, environmental, and content-driven perturbations. It incorporates both real and synthetic spoken instructions reflecting variations in speaker characteristics, environmental conditions, and content phenomena, establishing a new standard for comprehensive, end-to-end voice assistant benchmarking (Chen et al., 2024).

1. Motivation, Scope, and Design Principles

The impetus behind VoiceBench is the recognition that most existing benchmarks—such as LibriSpeech, CommonVoice, IndicSUPERB, AudioBench—emphasize isolated capabilities like transcription or isolated QA, neglecting real-world variabilities critical to robust speech interaction: diverse speaker demographics, variable acoustic environments, and content-related complexities including disfluency and adversarial queries. VoiceBench aims to quantify not just linguistic competence but the practical utility and resilience of voice assistants in authentic deployment contexts (Chen et al., 2024, Jain et al., 9 Oct 2025).

VoiceBench is therefore structured to address three primary axes of variation:

Speaker Characteristics: Age (children, adults, seniors), regional and international accents (e.g., Indian, Australian, Kenyan), speaking rate and pitch diversity.
Environmental Factors: Far-field attenuation, reverberation, background and babble noise, packet loss, signal distortion.
Content Variability: Disfluencies, mispronunciations, grammar errors, code-switching, and adversarial input.

This multipronged design is instantiated through the inclusion of both real-world and TTS-synthesized speech samples, with high-fidelity normalization and cross-system validation (Chen et al., 2024).

2. Benchmark Construction and Dataset Composition

VoiceBench comprises a heterogeneous set of eight core instruction subsets, totaling 5,783 samples in its initial public release (Chen et al., 2024), and extended in later works (e.g., 5,757 synthetic spoken queries in VoiceAgentBench for multilingual agentic evaluation (Jain et al., 9 Oct 2025)). The dataset is constructed as follows:

Subset	Speech Type	# Instances	Content Focus
AlpacaEval	Synthetic	636	Instruction following, open-ended QA (TTS)
CommonEval	Real	200	Real user queries from CommonVoice
SD-QA	Real	553	Varied-accent spoken QA (TyDi-QA etc.)
OpenBookQA	Synthetic	455	Science multiple-choice QA (TTS)
MMSU	Synthetic	3,074	Multi-task semantic understanding
IFEval	Synthetic	345	Explicit instruction following
AdvBench	Synthetic	520	Adversarial/safety evaluation
AlpacaEval*	Synthetic	199	Subset for ablation

Approximately 13% of VoiceBench’s samples are real human speech, the remainder being TTS-generated with accent, pitch, volume, and noise manipulations. Cross-validation with alternative TTS systems (e.g., CosyVoice, MeloTTS) ensures ranking consistency (Chen et al., 2024, Jain et al., 9 Oct 2025).

3. Evaluation Protocol and Metrics

VoiceBench employs a standardized evaluation protocol targeting downstream functional performance, not just transcription accuracy. After preprocessing, systems process spoken instructions, producing text responses evaluated by task-specific metrics (Chen et al., 2024):

Comprehension Accuracy (QA tasks):

$\mathrm{Acc} = \frac{\#\text{correct responses}}{\#\text{total tasks}} \times 100\%$

Used for SD-QA, MMSU, OpenBookQA, BBH, WildVoice.

Instruction-Following Accuracy (IFEval): Mixed metrics for strict and loose adherence.
Word Error Rate (WER):

$\mathrm{WER} = \frac{S + D + I}{N}$

Where $S$ =substitutions, $D$ =deletions, $I$ =insertions, $N$ =total reference words; principally for ASR inclusion diagnostics.

Refusal Rate (AdvBench, Safety):

$\mathrm{RefusalRate} = \frac{\#\text{safe refusals}}{\#\text{adversarial inputs}}\times100\%$

LLM-based Scoring (AlpacaEval, CommonEval, WildVoice): Open-ended responses are judged by a reference LLM (e.g., GPT-4o) on a 1–5 scale, rescaled for aggregation:

$r'_i = (r_i - 1) \times 25$

where $r_i\in [1,5]$ .

Composite Score: In standard leaderboards, the VoiceBench overall score is often computed as the mean or weighted mean of constituent subtask scores after normalization to [0,100].

These metrics enable joint assessment of factual QA, conversational fluency, safety robustness, and sensitivity to real-world distortions (Chen et al., 2024, Wang et al., 9 Jun 2026).

4. Comparative Results, Insights, and Analysis

VoiceBench experiments consistently reveal a substantial performance gap between text-based and speech-based agent inference (Chen et al., 2024, Hsu et al., 2 Mar 2026, Lu et al., 3 Jul 2025). Pipeline architectures (Whisper-v3 + LLM, e.g., LLaMA-3-8B) currently outperform open-source end-to-end SpeechLM baselines by >20 percentage points for spoken input; GPT-4o-Audio achieves top-line results on most subtasks. For instance, average speech-prompt performance for leading open models ((Chen et al., 2024), Table 2):

Model	Overall (%)
Naive-4o	89.5
DiVA	81.1
LLaMA-Omni (8B)	74.3
Qwen2-Audio (7B)	64.6
Mini-Omni2 (0.5B)	41.1
Moshi	27.5

Robustness analyses show that pipelines maintain accuracy across speaking rate (0.25×–2.0×), pitch, and accent perturbations, whereas end-to-end models are disproportionately fragile—particularly to child speech, low-resource accents, and environmental distortions (e.g., packet loss, far-field, reverberation). Content variabilities, especially mispronunciations and disfluencies, cause >20% drop in comprehension for open models (Chen et al., 2024).

Safety (measured as refusal to follow adversarial instructions) remains a key weakness for end-to-end approaches, with modality-specific vulnerabilities—some models safe-refuse in text but comply with spoken versions of the same prompt (Chen et al., 2024, Shao et al., 31 Dec 2025).

Extensions such as VoiceAgentBench (Jain et al., 9 Oct 2025) and τ-Voice (Ray et al., 14 Mar 2026) further elucidate deficiencies in multi-turn tool orchestration, agentic reasoning, and real-world dialogue dynamics, emphasizing the acute drop in parameter-filling accuracy (PF) for dependent tool workflows, especially in non-English settings (e.g., PF drops to ~15% for sequential calls in Whisper→LLaMA-70B and near-zero for current E2E SpeechLMs).

5. Theoretical and Methodological Implications

VoiceBench systematically exposes the "modality gap"—the observed performance delta between text and speech input—even in architectures with tightly coupled speech and language modules. Analyses using VoiceBench BBH reveal that the essential bottleneck is not merely geometric representation mismatch but the inherent redundancy and information dilution in speech input relative to tokenized text. This manifests as diffuse decision attention and reduced logit confidence in late LLM layers for speech inputs, even after extended fine-tuning (Hsu et al., 2 Mar 2026). Simple statistical calibration at the input layer can be detrimental, confirming the need for non-linear "structural transformation"—effective token condensation—in speech-to-text adaptation (Hsu et al., 2 Mar 2026).

Recent advances integrating self-generated cross-modal alignment (e.g., DeSTA2.5-Audio) or instruction-free tuning (AZeroS) demonstrate that deliberate data and adapter design can substantially improve generalization on VoiceBench, with self-generated targets preserving backbone LLM style and enhancing robust spoken-response quality (Lu et al., 3 Jul 2025, Shao et al., 31 Dec 2025).

6. Limitations and Frontiers

VoiceBench’s current scope is limited to text outputs, omitting synthesis and assessment of response speech naturalness, prosody, or end-to-end S2S fidelity (Chen et al., 2024). The synthetic data, while diverse, cannot fully capture spontaneous, multi-speaker, conversational phenomena encountered in deployment (e.g., overlapping speech, real environmental chaos). Standardized overall score aggregation is not officially mandated, leading to minor leaderboard inconsistencies. Closed-source models remain outside open leaderboards except for select reference runs (e.g., GPT-4o-Audio), and multi-turn, realtime, and cross-lingual evaluation is still nascent (Chen et al., 2024, Jain et al., 9 Oct 2025, Ray et al., 14 Mar 2026).

Recommendations include the extension of VoiceBench to: (1) evaluation of synthesized speech output and end-to-end S2S metrics; (2) increased coverage of real human recordings, code-switching, long-context dialogue, and non-English languages; (3) noisy and adversarial environments with dynamic user interaction; (4) unified pipeline vs. E2E evaluation frameworks; and (5) quantitative paralinguistic and emotional intelligence assessments (Chen et al., 2024, Jain et al., 9 Oct 2025, Liu et al., 21 May 2025, Selvakumar et al., 14 Jul 2025).

VoiceBench is now co-cited, extended, or cross-compared in all major speech and agentic LLM evaluation pipelines (Jain et al., 9 Oct 2025, Lu et al., 3 Jul 2025, Ray et al., 14 Mar 2026, Shao et al., 31 Dec 2025). Major related efforts include VocalBench (open S2S, paralinguistic and robustness focus) (Liu et al., 21 May 2025), VocalBench-zh (Mandarin S2S) (Liu et al., 11 Nov 2025), and VCB Bench (real-speech Mandarin agent control and robustness) (Hu et al., 13 Oct 2025). Omnimodal extensions such as MultiVox (Selvakumar et al., 14 Jul 2025) natively incorporate audio+visual context with explicit assessment of speech grounding, further broadening the VoiceBench paradigm.

By establishing a standard for rigorous, end-to-end speech interaction testing, VoiceBench is central to progress in robust and trustworthy conversational AI, illuminating mechanisms and bottlenecks in agent inference, training, and architecture across modalities.