S²Bench: Benchmarking Speech LLMs

Updated 6 February 2026

S²Bench is a diagnostic benchmark suite that quantifies intelligence degradation in speech LLMs by comparing audio-token processing with text-based reasoning.
It employs two aligned tasks—sentence continuation and commonsense reasoning—to robustly measure performance differences between speech and text modalities.
The evaluation protocol uses pairwise perplexity metrics to reveal performance gaps, guiding improvements in audio tokenization and training strategies.

S $^2$ Bench is a diagnostic benchmark suite designed to quantify the intelligence degradation of end-to-end speech LLMs as they process audio tokens directly, rather than text. The benchmark addresses the core question of how, and to what extent, speech-to-speech and speech-to-text models suffer losses in reasoning and language generation capabilities when compared to their text-only counterparts. S $^2$ Bench is comprised of diagnostic datasets targeting sentence continuation and commonsense reasoning under audio modality, alongside a robust pairwise perplexity-based evaluation protocol. Publicly released code and datasets provide a standardized means for systematic assessment and cross-model comparison (Fang et al., 20 May 2025).

1. Motivation: Intelligence Degradation in End-to-End Speech LLMs

The development of end-to-end speech-to-speech and speech-to-text LLMs aims to eliminate cascading error from traditional ASR+NLP pipelines and introduce architectures capable of capturing richer prosodic and speaker cues through direct operation on audio tokens. However, such models consistently underperform relative to text-only LLMs on tasks requiring abstract reasoning and language generation. This phenomenon, termed “intelligence degradation,” has been attributed to three principal sources:

Limited semantic density of audio tokens: Audio tokenizers yield representations with lower information content per token versus text tokens.
Longer token sequences: Audio input sequences typically contain 5–10 times more tokens than corresponding text, increasing sequence modeling complexity.
High variability in audio: Factors like prosody, speaker identity, and recording conditions introduce variance not present in text input.

S $^2$ Bench explicitly investigates the nature and quantitative extent of this degradation, isolating the impact of audio modality through a controlled suite of diagnostic tasks and evaluation metrics (Fang et al., 20 May 2025).

2. Benchmark Architecture and Diagnostic Tasks

S $^2$ Bench consists of two core diagnostic tasks, each provided in parallel text and audio versions. Audio samples are produced via synthesis or re-recording of the corresponding textual material, thereby ensuring modality alignment.

A. Sentence Continuation (“sStoryCloze”)

Based on the Story Cloze Test and its Chinese translation (“zh-sStoryCloze”).
Each instance: 4-sentence context plus two candidate continuations (one plausible, one implausible).
≈1,200 instances per language.
Text length: 70–90 tokens; audio duration: 12–18 s.

B. Commonsense Reasoning (“sCMMLU”)

Derived from CMMLU; each multiple-choice question reformed into four declarative sentences with identical first halves and diverging endings (one correct, three distractors).
4,743 questions → 18,972 statements.
Text length: 20–30 tokens; audio: 3–5 s.

This alignment enables robust modality comparison without altering task semantics, ensuring that observed performance differences chiefly reflect input representation effects.

3. Evaluation Protocol: Pairwise Perplexity and Task Degradation Metrics

The primary S $^2$ Bench evaluation protocol is a pairwise perplexity comparison, executed under two conditions:

Text→Text (T→T): Text input, text output. Serves as the reference upper bound.
Speech→Text (S→T): Audio input (as tokens), text output.

For each instance $i$ and for both modalities $m \in \{\text{T→T},\,\text{S→T}\}$ , the model computes:

$\operatorname{PPL}_\text{pos}(i, m)$ : Perplexity of the positive (plausible) continuation.
$\operatorname{PPL}_\text{neg}(i, m)$ : Perplexity of the negative (implausible) sample.

A model is correct on instance $i$ if $\operatorname{PPL}_\text{pos}(i, m) < \operatorname{PPL}_\text{neg}(i, m)$ . The fraction of correctly ranked pairs constitutes the model’s accuracy. To capture the confidence gap, S $^2$ Bench computes the mean per-instance gap:

$\Delta \operatorname{PPL}(m) = \mathbb{E}_i \left[ \operatorname{PPL}_\text{neg}(i, m) - \operatorname{PPL}_\text{pos}(i, m) \right]$

To isolate the performance loss induced by audio modality, S $^2$ Bench defines:

$\operatorname{Degradation} = \Delta \operatorname{PPL}(\text{S→T}) - \Delta \operatorname{PPL}(\text{T→T})$

Larger positive values of Degradation indicate more pronounced reasoning impairment under audio input.

4. Training Regimes and Case Study: Baichuan-Audio Model

The practical applicability of S $^2$ Bench was demonstrated through detailed experiments on Baichuan-Audio, an interleaved end-to-end speech LLM derived from Qwen2.5 (7B parameters). Two distinct training strategies were analyzed:

Single-Stage: All model parameters updated jointly.
Two-Stage:

Freeze text-LM weights, train only audio embedding and audio head.
Unfreeze all but LM embedding/head to refine full model.

Key findings include:

Both strategies reach ≈83% accuracy on the story continuation task in the T→T condition, paralleling Qwen2.5.
In S→T, the single-stage regime attained 77.5% (sStoryCloze), whereas two-stage improved to 79.6%.
Two-stage yielded faster, more stable convergence and cleaner negative/positive sample loss separation, indicating enhanced confidence calibration.

5. Comparative Results: Model Performance on S $^2$ Bench

Table 1 provides a comparative summary of S $^2$ Bench accuracy results for leading models:

Model	Modality	Params	sStoryCloze	zh-sStoryCloze	sCMMLU
TWIST	S→T	7B	53.3	–	–
Moshi	S→T	7B	60.8	–	–
GLM-4-Voice	S→T	9B	76.3	70.3*	64.3*
Qwen2.5	T→T	7B	83.0	76.1	70.3
Baichuan-Audio (single)	S→T	7B	77.5	70.1	67.0
Baichuan-Audio (two-stage)	S→T	7B	79.6	72.4	69.3

*Evaluated on the instruct-tuned variant.

Principal observations (Fang et al., 20 May 2025):

Off-the-shelf, fully end-to-end S→T models are outperformed by text-based models by margins ranging from 5 to 30 percentage points.
Interleaved architectures (GLM-4-Voice, Baichuan-Audio) reduce, but do not close, the accuracy gap (remaining ≈3–5 points behind text models).
Commonsense tasks (sCMMLU) exhibit marginally less degradation (≈1–3 points) compared to discourse-level (sStoryCloze) coherence.

6. Methodological Innovations and Diagnostic Insights

S $^2$ Bench’s main methodological contribution is a pairwise perplexity protocol that isolates reasoning ability without requiring changes in model architecture or downstream task design. By aligning diagnostic datasets across text and speech modalities and measuring response plausibility through perplexity differences, S $^2$ Bench enables controlled, modality-specific quantification of model deficits.

Analysis reveals that current model architectures and audio tokenization pipelines limit semantic density and introduce sequence length and variability challenges that are not addressed by conventional training. Two-stage training partly mitigates, but does not eliminate, these deficits—a finding substantiated via confidence margin analysis and learning curve stability.

7. Future Directions and Open Questions

S $^2$ Bench exposes critical challenges for end-to-end speech model evaluation and motivates several avenues for future research:

Development of a full Speech→Speech (S→S) evaluation protocol remains out of reach due to heterogeneity in audio tokenization schemes across models.
Expansion of diagnostic coverage to include multi-turn dialogue, arithmetic reasoning, and code synthesis under audio input is a priority.
Redesign of audio tokenization strategies, such as semantic clustering or learned vocabularies to increase semantic density, is identified as a potential means of mitigating intelligence degradation.

All datasets and evaluation code are publicly accessible at https://github.com/undobug/S2SBench, providing an open platform for benchmarking and further methodological development (Fang et al., 20 May 2025).

Markdown Upgrade to Chat

References (1)

S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to S$^2$Bench.

S²Bench: Benchmarking Speech LLMs

1. Motivation: Intelligence Degradation in End-to-End Speech LLMs

2. Benchmark Architecture and Diagnostic Tasks

A. Sentence Continuation (“sStoryCloze”)

B. Commonsense Reasoning (“sCMMLU”)

3. Evaluation Protocol: Pairwise Perplexity and Task Degradation Metrics

4. Training Regimes and Case Study: Baichuan-Audio Model

5. Comparative Results: Model Performance on S $^2$ Bench

6. Methodological Innovations and Diagnostic Insights

7. Future Directions and Open Questions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

S²Bench: Benchmarking Speech LLMs

1. Motivation: Intelligence Degradation in End-to-End Speech LLMs

2. Benchmark Architecture and Diagnostic Tasks

A. Sentence Continuation (“sStoryCloze”)

B. Commonsense Reasoning (“sCMMLU”)

3. Evaluation Protocol: Pairwise Perplexity and Task Degradation Metrics

4. Training Regimes and Case Study: Baichuan-Audio Model

5. Comparative Results: Model Performance on S2^22Bench

6. Methodological Innovations and Diagnostic Insights

7. Future Directions and Open Questions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

5. Comparative Results: Model Performance on S $^2$ Bench