Papers
Topics
Authors
Recent
2000 character limit reached

S²Bench: Benchmarking Speech LLMs

Updated 6 February 2026
  • S²Bench is a diagnostic benchmark suite that quantifies intelligence degradation in speech LLMs by comparing audio-token processing with text-based reasoning.
  • It employs two aligned tasks—sentence continuation and commonsense reasoning—to robustly measure performance differences between speech and text modalities.
  • The evaluation protocol uses pairwise perplexity metrics to reveal performance gaps, guiding improvements in audio tokenization and training strategies.

S2^2Bench is a diagnostic benchmark suite designed to quantify the intelligence degradation of end-to-end speech LLMs as they process audio tokens directly, rather than text. The benchmark addresses the core question of how, and to what extent, speech-to-speech and speech-to-text models suffer losses in reasoning and language generation capabilities when compared to their text-only counterparts. S2^2Bench is comprised of diagnostic datasets targeting sentence continuation and commonsense reasoning under audio modality, alongside a robust pairwise perplexity-based evaluation protocol. Publicly released code and datasets provide a standardized means for systematic assessment and cross-model comparison (Fang et al., 20 May 2025).

1. Motivation: Intelligence Degradation in End-to-End Speech LLMs

The development of end-to-end speech-to-speech and speech-to-text LLMs aims to eliminate cascading error from traditional ASR+NLP pipelines and introduce architectures capable of capturing richer prosodic and speaker cues through direct operation on audio tokens. However, such models consistently underperform relative to text-only LLMs on tasks requiring abstract reasoning and language generation. This phenomenon, termed “intelligence degradation,” has been attributed to three principal sources:

  • Limited semantic density of audio tokens: Audio tokenizers yield representations with lower information content per token versus text tokens.
  • Longer token sequences: Audio input sequences typically contain 5–10 times more tokens than corresponding text, increasing sequence modeling complexity.
  • High variability in audio: Factors like prosody, speaker identity, and recording conditions introduce variance not present in text input.

S2^2Bench explicitly investigates the nature and quantitative extent of this degradation, isolating the impact of audio modality through a controlled suite of diagnostic tasks and evaluation metrics (Fang et al., 20 May 2025).

2. Benchmark Architecture and Diagnostic Tasks

S2^2Bench consists of two core diagnostic tasks, each provided in parallel text and audio versions. Audio samples are produced via synthesis or re-recording of the corresponding textual material, thereby ensuring modality alignment.

A. Sentence Continuation (“sStoryCloze”)

  • Based on the Story Cloze Test and its Chinese translation (“zh-sStoryCloze”).
  • Each instance: 4-sentence context plus two candidate continuations (one plausible, one implausible).
  • ≈1,200 instances per language.
  • Text length: 70–90 tokens; audio duration: 12–18 s.

B. Commonsense Reasoning (“sCMMLU”)

  • Derived from CMMLU; each multiple-choice question reformed into four declarative sentences with identical first halves and diverging endings (one correct, three distractors).
  • 4,743 questions → 18,972 statements.
  • Text length: 20–30 tokens; audio: 3–5 s.

This alignment enables robust modality comparison without altering task semantics, ensuring that observed performance differences chiefly reflect input representation effects.

3. Evaluation Protocol: Pairwise Perplexity and Task Degradation Metrics

The primary S2^2Bench evaluation protocol is a pairwise perplexity comparison, executed under two conditions:

  • Text→Text (T→T): Text input, text output. Serves as the reference upper bound.
  • Speech→Text (S→T): Audio input (as tokens), text output.

For each instance ii and for both modalities m{T→T,S→T}m \in \{\text{T→T},\,\text{S→T}\}, the model computes:

  • PPLpos(i,m)\operatorname{PPL}_\text{pos}(i, m): Perplexity of the positive (plausible) continuation.
  • PPLneg(i,m)\operatorname{PPL}_\text{neg}(i, m): Perplexity of the negative (implausible) sample.

A model is correct on instance ii if PPLpos(i,m)<PPLneg(i,m)\operatorname{PPL}_\text{pos}(i, m) < \operatorname{PPL}_\text{neg}(i, m). The fraction of correctly ranked pairs constitutes the model’s accuracy. To capture the confidence gap, S2^2Bench computes the mean per-instance gap:

ΔPPL(m)=Ei[PPLneg(i,m)PPLpos(i,m)]\Delta \operatorname{PPL}(m) = \mathbb{E}_i \left[ \operatorname{PPL}_\text{neg}(i, m) - \operatorname{PPL}_\text{pos}(i, m) \right]

To isolate the performance loss induced by audio modality, S2^2Bench defines:

Degradation=ΔPPL(S→T)ΔPPL(T→T)\operatorname{Degradation} = \Delta \operatorname{PPL}(\text{S→T}) - \Delta \operatorname{PPL}(\text{T→T})

Larger positive values of Degradation indicate more pronounced reasoning impairment under audio input.

4. Training Regimes and Case Study: Baichuan-Audio Model

The practical applicability of S2^2Bench was demonstrated through detailed experiments on Baichuan-Audio, an interleaved end-to-end speech LLM derived from Qwen2.5 (7B parameters). Two distinct training strategies were analyzed:

  • Single-Stage: All model parameters updated jointly.
  • Two-Stage:
  1. Freeze text-LM weights, train only audio embedding and audio head.
  2. Unfreeze all but LM embedding/head to refine full model.

Key findings include:

  • Both strategies reach ≈83% accuracy on the story continuation task in the T→T condition, paralleling Qwen2.5.
  • In S→T, the single-stage regime attained 77.5% (sStoryCloze), whereas two-stage improved to 79.6%.
  • Two-stage yielded faster, more stable convergence and cleaner negative/positive sample loss separation, indicating enhanced confidence calibration.

5. Comparative Results: Model Performance on S2^2Bench

Table 1 provides a comparative summary of S2^2Bench accuracy results for leading models:

Model Modality Params sStoryCloze zh-sStoryCloze sCMMLU
TWIST S→T 7B 53.3
Moshi S→T 7B 60.8
GLM-4-Voice S→T 9B 76.3 70.3* 64.3*
Qwen2.5 T→T 7B 83.0 76.1 70.3
Baichuan-Audio (single) S→T 7B 77.5 70.1 67.0
Baichuan-Audio (two-stage) S→T 7B 79.6 72.4 69.3

*Evaluated on the instruct-tuned variant.

Principal observations (Fang et al., 20 May 2025):

  • Off-the-shelf, fully end-to-end S→T models are outperformed by text-based models by margins ranging from 5 to 30 percentage points.
  • Interleaved architectures (GLM-4-Voice, Baichuan-Audio) reduce, but do not close, the accuracy gap (remaining ≈3–5 points behind text models).
  • Commonsense tasks (sCMMLU) exhibit marginally less degradation (≈1–3 points) compared to discourse-level (sStoryCloze) coherence.

6. Methodological Innovations and Diagnostic Insights

S2^2Bench’s main methodological contribution is a pairwise perplexity protocol that isolates reasoning ability without requiring changes in model architecture or downstream task design. By aligning diagnostic datasets across text and speech modalities and measuring response plausibility through perplexity differences, S2^2Bench enables controlled, modality-specific quantification of model deficits.

Analysis reveals that current model architectures and audio tokenization pipelines limit semantic density and introduce sequence length and variability challenges that are not addressed by conventional training. Two-stage training partly mitigates, but does not eliminate, these deficits—a finding substantiated via confidence margin analysis and learning curve stability.

7. Future Directions and Open Questions

S2^2Bench exposes critical challenges for end-to-end speech model evaluation and motivates several avenues for future research:

  • Development of a full Speech→Speech (S→S) evaluation protocol remains out of reach due to heterogeneity in audio tokenization schemes across models.
  • Expansion of diagnostic coverage to include multi-turn dialogue, arithmetic reasoning, and code synthesis under audio input is a priority.
  • Redesign of audio tokenization strategies, such as semantic clustering or learned vocabularies to increase semantic density, is identified as a potential means of mitigating intelligence degradation.

All datasets and evaluation code are publicly accessible at https://github.com/undobug/S2SBench, providing an open platform for benchmarking and further methodological development (Fang et al., 20 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to S$^2$Bench.