VCB Bench: Audio-Grounded Voice Chat Evaluation

Updated 4 July 2026

VCB Bench is an evaluation benchmark for Chinese voice conversational agents that uses real human speech and offers bilingual instruction following tasks.
It assesses systems across three dimensions: instruction following, knowledge understanding, and robustness to acoustic and linguistic variations.
The benchmark employs rigorous, multi-stage quality control and diverse real-speech sources to detail system performance and inherent limitations.

Searching arXiv for the primary benchmark paper and closely related voice-agent evaluation benchmarks. {"query":"(Hu et al., 13 Oct 2025) VCB Bench An Evaluation Benchmark for Audio-Grounded LLM Conversational Agents", "max_results": 5} Voice Chat Bot Bench (VCB Bench) is an evaluation benchmark for audio-grounded LLM conversational agents, introduced as the first comprehensive, real-speech evaluation benchmark tailored to Chinese voice conversational agents built with large audio LLMs (LALMs). It was proposed to address three limitations in prior evaluation practice: English-centric coverage, heavy dependence on synthetic TTS audio, and reuse of text-derived QA corpora whose formal and lengthy style is mismatched with voice conversation. VCB Bench therefore focuses on Chinese, uses only real human speech, and evaluates LALMs from three complementary perspectives: instruction following, knowledge understanding, and robustness (Hu et al., 13 Oct 2025).

1. Rationale and scope

VCB Bench is designed for practical voice conversational systems rather than text-first assistants with speech attached as a transport layer. Its central design claim is that evaluation must reflect the acoustic, prosodic, and paralinguistic variability of real interaction. For that reason, all audio in the benchmark is recorded by people rather than synthesized by TTS, and robustness subsets include re-recordings by the same speaker so that interference conditions can be varied while speaker identity is controlled (Hu et al., 13 Oct 2025).

The benchmark is primarily Chinese, but it is not monolingual in a narrow sense. It provides bilingual Chinese-English coverage for both text instruction following and speech instruction following, and it includes dialectal and accent-sensitive material. The speech instruction following module includes language control items such as dialect switching, including Shanghai dialect, while the robustness design includes Tianjin, Beijing, Dongbei, and Sichuan accents. Code-switching is also explicitly represented through a content-variation category that mixes Chinese and English. This combination makes VCB Bench both Chinese-centered and bilingual in selected submodules rather than a generic multilingual benchmark (Hu et al., 13 Oct 2025).

The benchmark’s scope is organized around three dimensions that correspond to different operational requirements of a deployed voice agent. Instruction following examines compliance with text and speech controls, including paralinguistic constraints and multi-turn dialogue. Knowledge understanding measures general knowledge, mathematical and logical reasoning, discourse comprehension, and story continuation. Robustness measures stability under perturbations in content, environment, and speaker traits. The paper presents these as complementary rather than interchangeable dimensions (Hu et al., 13 Oct 2025).

2. Corpus construction and quality control

VCB Bench integrates three real-speech sources, each aligned to specific submodules. The first source is third-party professional recording, which supports Instruction Following, Mathematical & Logical Reasoning, Story Continuation, and Robustness. Its pipeline begins with task definition and examples through team discussion, followed by manual composition of task texts by commissioned personnel, manual inspection of those texts, professional audio recording of approved texts by a third-party team, post-recording quality checks by the data team, GPT-4o-audio evaluation of audio quality, GPT-4o evaluation of textual quality, and final manual screening for high-quality samples. Robustness construction is layered on top of this source: the original instruction-following audio serves as the control group, the same speaker re-records under specified interference such as accent or noisy environment, content variation subsets alter the text and then re-record it by the same speaker, and some extreme-condition subsets such as Volume, Speed, and Unstable Signal undergo post-processing (Hu et al., 13 Oct 2025).

The second source is variety show question-answer data, which supports the General Knowledge module. The pipeline begins with crawling about 20 hours of Q&A audio, then manual annotation and timestamp-based segmentation, ASR transcription, GPT-4o scoring of question clarity and answer accuracy, subject-domain classification, threshold-based selection, and final manual review of transcription accuracy, answer correctness, and audio clarity. The third source is internal two-person dialogue data for Discourse Comprehension. Long-form audio is segmented in two stages with GPT-4O—first by topic, then into semantically coherent segments under 1 minute—and GPT-4O generates task-specific QA pairs for analysis, induction, and inference, followed by manual screening for question quality and answer accuracy (Hu et al., 13 Oct 2025).

Quality control is multi-stage and combines manual inspection with model-based scoring. GPT-4o-audio is used for audio quality and GPT-4o for textual quality, followed by manual selection. At the same time, the paper explicitly notes several omissions: it does not report annotator agreement metrics such as Cohen’s $\kappa$ , it does not describe consent or anonymization procedures, and it does not specify sampling rate, bit depth, SNR ranges, or train/dev/test splits. These omissions are part of the benchmark’s documented limitations rather than unstated background assumptions (Hu et al., 13 Oct 2025).

3. Task structure and benchmark composition

VCB Bench is fine-grained both in dimension design and in subtask cardinality. The benchmark includes bilingual instruction-following tasks, knowledge subsets across 12 disciplines, multi-type reasoning, multi-turn dialogue, story continuation, and robustness perturbations. Representative subtask sizes given by the paper are summarized below.

Dimension	Subtasks	Representative sizes
Instruction following	Chinese TIF, English TIF-En, Chinese SIF, English SIF-En, MTD	Chinese TIF: Continuation 200, Creation 200, Empathy 200, Recommendation 200, Rewriting 165, Safety 200, Simulation 200; MTD: Progression 80, Backtracking 80, Transition 80
Knowledge understanding	GK, ML, DC, SC	GK includes Mathematics 36, Geography 150, Physics 102, Culture 150; ML includes Basic Math 146, Medium Math 170, Logic 159; DC includes Analysis 115, Induction 113, Inference 103
Robustness	SV, EV, CV	CV includes Fillers 100, Repetition 87, Mispronunciation 89, Grammatical Error 69, Topic Shift 91, Code Switching 92

Instruction following is divided into text instruction following and speech instruction following. Text instruction following, in both Chinese and English, covers continuation, creation, empathy, recommendation, rewriting, safety, and simulation or role-play. Speech instruction following tests audio generation under explicit speech-level controls: Emotional Control, Language Control, Non-verbal Vocalization, Pacing Control, Style Control, and Volume Control. These tasks extend beyond text-conditioned compliance by requiring the output speech itself to realize specified paralinguistic attributes such as loudness, speaking rate, emotion, or dialect choice (Hu et al., 13 Oct 2025).

Multi-turn Dialogue (MTD) is structured as 3–5 turn interactions with three categories: Progression, Backtracking, and Transition. The scoring emphasizes the final turn, which accounts for 50% of each sample’s score. This gives the module a dialogue-management orientation rather than a purely local-turn one, since the later turn is weighted most heavily (Hu et al., 13 Oct 2025).

Knowledge understanding is subdivided into four modules. General Knowledge is reference-based QA across 12 disciplines: mathematics, geography, politics, chemistry, biology, law, physics, history, medicine, economics, sports, and culture. Mathematical & Logical Reasoning covers Basic Math, Medium Math, Analysis, Induction, Analogy, and Logic. Discourse Comprehension evaluates factual analysis, theme induction, and inference over dialogue segments. Story Continuation, inspired by StoryCloze, is evaluated in both speech-to-text and speech-to-speech modes across Logic & Causality, Common Sense & Science, and Morality & Emotion (Hu et al., 13 Oct 2025).

Robustness is also tripartite. Speaker Variations include age conditions such as child and elder speech, accent conditions, volume changes, and fast speech. Environmental Variations include non-vocal noise such as echo, outdoors, and far-field, vocal noise such as TV playback, background chat, vocal-music, and voice announcement, and Unstable Signal via packet-loss simulation. Content Variations include fillers, repetition, mispronunciation, grammatical errors, topic shift, and code-switching. The taxonomy is therefore broader than a simple noise-stress test; it separates physical, speaker-specific, and linguistic perturbations (Hu et al., 13 Oct 2025).

4. Evaluation protocol and metrics

The benchmark uses a standardized protocol but does not reduce performance to a single weighted composite. For Chinese and English speech instruction following, models’ audio responses are scored automatically by GPT-4o-audio for adherence to speech-level controls. For other tasks, except Story Continuation, models generate text through audio2text APIs and GPT-4O serves as judge. Open-ended QA is scored numerically on a 1–5 scale, while reference-based QA uses binary Yes/No judgments. Story Continuation is evaluated through negative log-likelihood comparison between two candidate endings rather than by an explicit generative-quality rubric. The paper states that no explicit formula is printed for the NLL decision rule (Hu et al., 13 Oct 2025).

The MTD protocol follows Bai et al. (2024): the model answers each turn using the original ground-truth context rather than its own prior responses, and the final turn contributes 50% of the sample score. For subjective evaluation, the benchmark collects Mean Opinion Score (MOS) from eight expert raters on 30 sampled items per relevant dataset for top speech-instruction-following models, specifically to compare human judgment with automatic evaluation. Aggregation is reported as averaged scores per subtask and per dimension. The paper does not provide weighted-sum formulas such as $S=\sum w_i s_i$ , and it does not report statistical significance tests or $p$ -values (Hu et al., 13 Oct 2025).

The protocol also includes a text-speech alignment ablation that compares audio-to-text direct evaluation, audio-to-audio evaluation with ASR, and audio-to-audio evaluation without ASR. This is not a headline metric in the main benchmark tables, but it is an important diagnostic. It is used to analyze semantic alignment between generated text and generated speech, and to separate alignment errors from audio clarity errors across models (Hu et al., 13 Oct 2025).

5. Empirical results and diagnostic findings

The evaluated models include GLM4-Voice, Kimi-Audio, Qwen2.5-Omni, Baichuan-Audio-Chat, Qwen2-Audio-Instruct, StepAudio, StepAudio2Mini, Mimo-Audio, and GPT4o-Audio, with several pretrained base variants evaluated separately for Story Continuation. The main results show a stratified capability profile rather than a universal winner. In instruction following, GPT4o-Audio attains 91.24 on Chinese TIF, 91.66 on English TIF-En, 88.15 on Chinese SIF, and 86.07 on English SIF-En; MTD is unavailable for that model because of API constraints. Among open-source systems, StepAudio2Mini reaches 87.80 on MTD and is reported as the best open-source model on that module. In knowledge understanding, GPT4o-Audio leads General Knowledge with 61.29, while Mimo-Audio leads Mathematical & Logical Reasoning with 84.01 and Discourse Comprehension with 87.92. These results expose asymmetric capability profiles: Mimo-Audio is strong on ML and DC but weak on English SIF-En at 24.25, whereas StepAudio2Mini is comparatively strong on multi-turn dialogue (Hu et al., 13 Oct 2025).

Granular results sharpen that picture. In Chinese SIF objective evaluation, GPT4o-Audio averages 88.15 and leads Emotional Control at 92.40 and Language Control at 87.80, while Kimi-Audio reaches 95.00 on Style Control. In English SIF objective evaluation, GPT4o-Audio again leads most sub-dimensions with an overall score of 86.07, and several open-source models drop sharply, including Mimo-Audio at 24.25. In General Knowledge, GPT4o-Audio shows breadth across domains, including Economics at 85.42 and Geography at 62.00, while StepAudio2Mini is strong in Chemistry at 80.43 and Physics at 74.51. In ML, Kimi-Audio leads Basic Math at 98.63, but Mimo-Audio is best overall, especially on Logic and Analogy. In DC, Mimo-Audio stands out in Inference at 95.15 and Induction at 88.50 (Hu et al., 13 Oct 2025).

Story Continuation is evaluated separately on pretrained base variants. Kimi-Audio-Base leads both speech-to-text and speech-to-speech modes, with average scores of 78.01 and 54.71 respectively. All models perform worse in speech-to-speech than in speech-to-text. This suggests that maintaining semantic coherence directly in generated speech remains more difficult than selecting or scoring through a textual interface (Hu et al., 13 Oct 2025).

Robustness results identify physical perturbations as the most severe source of degradation. Large drops are reported for fast speech, echo, and elder speech. For example, under Speed, GLM4-Voice falls to 39.20 with a drop of 55.20, GPT4o-Audio to 43.40 with a drop of 48.40, and Mimo-Audio to 60.40 with a drop of 33.00. Under Echo, GLM4-Voice reaches 45.00 with a drop of 36.80, Qwen2.5-Omni reaches 52.00 with a drop of 30.60, and Mimo-Audio reaches 54.00 with a drop of 35.60. By contrast, content-level perturbations are often milder for stronger models: GPT4o-Audio scores 88.40 on Fillers with a positive difference of 1.80, and Mimo-Audio scores 96.00 on Mispronunciation with a drop of only 3.60. Environmental noise is heterogeneous; outdoors and vocal-music are comparatively benign for stronger models, whereas echo and unstable signal remain disruptive (Hu et al., 13 Oct 2025).

The alignment ablation further separates semantic mismatch from audio clarity. GLM4-Voice in Chinese and Baichuan-Audio-Chat in English show close agreement between audio-to-text and audio-to-audio with ASR, indicating consistent semantics between text and generated speech. Qwen2.5-Omni in Chinese and Kimi-Audio in English show large gaps, suggesting semantic differences between text and speech generation. Small gaps between audio-to-audio with ASR and audio-to-audio without ASR indicate clearer audio; GLM4-Voice and Kimi-Audio in Chinese, and Baichuan-Audio-Chat in English, are reported as relatively clear by this criterion, while Kimi-Audio in English exhibits large ASR-induced drops. Subjective and objective speech-instruction-following evaluation also diverge for some models: GPT4o-Audio and GLM4-Voice have relatively small discrepancies between MOS and GPT-4o-audio scores, whereas Kimi-Audio shows larger gaps in some sub-dimensions, especially Language (Hu et al., 13 Oct 2025).

6. Reproducibility, limitations, and benchmark context

The authors release code and data at https://github.com/193746/VCB-Bench-Evalkit, including evaluation scripts and benchmark datasets spanning the three dimensions and their subtasks. The appendix provides exemplar items across tasks, which clarifies formats and expectations. The paper also gives concrete replication guidance: use the released Evalkit for standardized audio2audio and audio2text evaluations; score SIF with GPT-4o-audio; score open-ended text tasks with GPT-4O on a 1–5 scale; use binary Yes/No judgments for reference-based QA; apply the ground-truth-context MTD protocol with a 50% final-turn weight; compute NLL comparisons for Story Continuation; and use eight expert raters with 30 sampled items per dataset for subjective audio quality where applicable (Hu et al., 13 Oct 2025).

The paper records several limitations. Rapid model evolution means benchmark coverage must be continuously updated. English coverage is not universal across all subsets. Prompting strategies may not fully exploit model capabilities. The article also notes metadata and annotation gaps, including the absence of annotator agreement statistics, detailed annotation guidelines, judge prompts, licensing information, and fine-grained audio specifications such as sampling rate, bit depth, and SNR ranges. A plausible implication is that VCB Bench is strong as an evaluation scaffold and weaker as a fully specified dataset card in the documentation sense (Hu et al., 13 Oct 2025).

Within the wider benchmark landscape, VCB Bench occupies a distinctive niche. WavBench introduces a tripartite framework centered on Pro reasoning, Basic colloquial “listenability,” and Acoustic interaction for end-to-end spoken dialogue models, but its corpus is largely synthesized rather than entirely real-speech (Li et al., 12 Feb 2026). EVA-Bench targets enterprise voice agents through live bot-to-bot audio conversations, simulator validation, composite EVA-A and EVA-X metrics, and pass@1, pass@k, and pass $^k$ reporting across cascade, hybrid, and speech-to-speech architectures (Bogavelli et al., 13 May 2026). VoiceAgentBench focuses on speech-based agentic tool use with over 5,500 synthetic spoken queries across English, Hindi, and five other Indian languages, emphasizing tool selection, structural consistency, parameter filling, and safety (Jain et al., 9 Oct 2025). VocalBench-zh provides a Mandarin-adapted speech-to-speech suite with 10 subsets and 11,115 instances, extending evaluation to creativity, empathy, code-switching, safety, and robustness in a broader Mandarin setting (Liu et al., 11 Nov 2025). WildSpeech-Bench concentrates on English single-turn natural speech conversation, human-recorded paralinguistic phenomena, overlapping speech, and query-aware evaluation (Zhang et al., 27 Jun 2025). Against these benchmarks, VCB Bench is specifically defined by its Chinese orientation, entirely human-recorded audio, and three-way decomposition into instruction following, knowledge understanding, and robustness.