Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAGI Benchmark for Speech LLMs

Updated 16 May 2026
  • SAGI Benchmark is a comprehensive evaluation suite designed to measure speech LLM capabilities across five levels, from basic ASR to specialized tasks.
  • It standardizes tasks, datasets, I/O formats, and formal metrics to diagnose limitations and drive improvements in speech intelligence models.
  • The benchmark’s multi-level roadmap guides innovations in data diversity, model integration, and instruction tuning to elevate superhuman speech understanding.

The SAGI Benchmark is a comprehensive evaluation suite introduced to standardize and catalyze progress toward superhuman speech understanding using LLMs. Developed in conjunction with a proposed five-level roadmap, SAGI structures evaluation across a spectrum of speech comprehension competencies, from basic automatic speech recognition (ASR) to the capacity for generalist, creative, and specialist tasks that require integration of non-semantic and abstract acoustic knowledge. The benchmark systematizes tasks, datasets, inputs, outputs, and formal metrics to diagnose capabilities and limitations in current state-of-the-art speech LLMs (Bu et al., 2024).

1. Five-Level SAGI Roadmap: Hierarchy of Speech Understanding

SAGI operationalizes a five-level hierarchical framework representing escalating complexity and generality in speech LLM capabilities:

  1. Level S (Semantic Recognition; L1, Basic ASR): Raw speech-to-text mapping, establishing a semantic foundation equivalent to composing ASR and LLM in cascade.
  2. Level A (Acoustic-Feature Perception; L2, Paralinguistic Perception): Low-level acoustic cue detection, such as loudness, pitch, rhythm, and spatialization (binaural location), testing the model’s capability to process physical audio properties beyond lexical content.
  3. Level G (General Non-Semantic Comprehension; L3): Interpretation of higher-order non-textual information—including speaker identity (age, gender), affect (emotions), ambient environment, sarcasm, and singing.
  4. Level I (Integration of Abstract Acoustic Knowledge; L4, Speech Specialist): Application of domain-specialist knowledge to acoustic analysis for medical inference (e.g., COVID-19 risk and cough type), pronunciation grading, or music understanding, necessitating abstract reasoning on audio input.
  5. Level AGI (Speech Artificial General Intelligence; L5, Generalist): Unification of semantic, paralinguistic, and domain-expert competencies in novel or creative tasks (e.g., spoken-English coaching, “voice detective”), reflecting aspirations of superhuman, generalist speech comprehension.

This tiered stratification surfaces system bottlenecks and underlines upper-bound targets for speech LLM development (Bu et al., 2024).

2. Standardized Task Suite, Datasets, and I/O Formats

The SAGI Benchmark provides a rigorously organized task suite at each level, employing canonical and custom datasets. Audio is consistently downsampled to 16 kHz mono, with utterances no longer than 30 seconds.

Level Task Examples Dataset(s) / Task Type
S Language ID, general ASR, legal/medical ASR, lyrics Europarl-ST, LibriSpeech, custom Chinese, JamendoLyrics
A Volume perception, pitch, binaural location LJSpeech, SpeechAccentArchive, custom stereo
G Ambient noise, acoustic scene, speaker age/gender, emotion, singing NoisySpeech, MS-SNSD, VCTK, RAVDESS
I COVID-19 risk, cough type/origin/severity Virufy, COUGHVID
AGI Spoken-English coaching, voice detective speechocean762, SpeechAccentArchive

Each benchmarked task specifies a question–answer input/output format. For instance, Level S language ID is posed as "What language is spoken?" (five-way choice), whereas Level I COVID-19 risk asks for binary classification based on audio symptoms (Bu et al., 2024).

3. Formal Evaluation Metrics

SAGI prescribes four principal metrics, standardized for comparability and reliability:

WER=S+D+IN\text{WER} = \frac{S + D + I}{N}

where SS = substitutions, DD = deletions, II = insertions, NN = number of reference words. Samples with WER>1\mathrm{WER} > 1 are marked "failed." Both mean WER (over valid completions) and ASR completion rate are reported.

  • Classification Accuracy (for multiple-choice tasks):

Accuracy=# correct predictions# total samples\text{Accuracy} = \frac{\text{\# correct predictions}}{\text{\# total samples}}

Only exact matches are credited as correct.

  • Term-Insertion Accuracy (for legal/medical ASR): Success is assigned if a required term is present anywhere in the model output transcript.
  • GPT-4o Scoring (for subjective responses, e.g., Spoken-English Coach or Emotion Translation): Outputs receive scores from 0 to 4 devised by GPT-4o under a fixed rubric, quantifying qualitative performance such as emotion preservation in translation (Bu et al., 2024).

4. Quantification of Paralinguistic and Abstract Knowledge

The benchmark explicitly measures non-semantic and paralinguistic performance:

  • Level A tasks: Binary accuracies on direct perception (volume, pitch, binaural cues).
  • Level G tasks: Multi-way classification accuracies for emotion, speaker characteristics, environmental sounds.
  • Level I tasks: Success rates for medical/professional abstract inference (COVID-19 risk prediction, cough diagnostics).

Performance on these axes reveals deficits in both perception and abstraction, particularly when models are required to process signal properties orthogonal to semantic transcription (Bu et al., 2024).

5. Experimental Protocol and Baseline Systems

The evaluation protocol prescribes:

  • Canonical splits and sample sizes per task (e.g., LibriSpeech test sets for ASR, 80 samples per label for classification).
  • Human benchmarks: Four native or near-native English speakers establish human performance, with subset evaluation consistency ≥85% for most tasks.
  • Speech-LLM baselines: Evaluated models include GPT-4o (API, advanced speech mode), Qwen2-Audio (7B), Mu-LLaMA (7B), GAMA, and SALMONN, all executed on NVIDIA A800 GPUs.
  • Input robustness: Prompts formatted as text and, in a subset, synthesized speech instructions (using Google TTS) to assess sensitivity to instruction modality and paraphrasing (Bu et al., 2024).

6. Identified Gaps and Failure Modes

Empirical benchmarking uncovers significant gaps between human and model performance, particularly for paralinguistic and abstract-acoustic tasks:

  • Human vs Machine: Humans reach ∼100% on volume/binaural, ∼90% on environment/scene, ∼50% on speaker age, and ∼60% on a cappella emotion. On specialist medical tasks, human accuracy drops to ∼30–60%. For L5 tasks (Spoken-English Coach, Voice Detective), humans achieve GPT-style rubric scores of 1.2–1.39/4.
  • LLM Deficiencies: Speech LLMs underperform substantially at:
    • Level A: 29–53% on pitch/volume/binaural (vs. 96–100% human)
    • Level G: 16–27% on environment, 13–38% on speaker age (vs. ∼90% and 53% human)
    • Level I: 14–50% on COVID-19 risk (vs. 60% human), 4–25% on cough origin.
    • Level AGI: GPT-4o scores ∼0.15–1.29/4 vs. 1.2–1.39/4 human.
  • Side Analyses: SAGI reveals that model acoustic encoders (e.g., Whisper) exhibit high cosine similarity (>0.85 for 5s audio; >0.54 for 30s) for semantically equivalent utterances with different emotion/gender labels, implying poor paralinguistic separation. Additionally, ∼60% of ASR errors in speech LLMs are due to utterance truncation absent in Whisper, and accuracy degrades by up to ±15% with prompt paraphrase, indicating poor instruction following (Bu et al., 2024).

7. Recommendations and Future Directions

SAGI’s findings guide several future directions:

  • Data Diversity: The incorporation of balanced and additional paralinguistic (e.g., pitch, loudness) and specialist (medical, music, animal sounds) corpora is recommended to close data gaps.
  • Encoder–LLM Integration: New architectures—such as adapters and cross-attention—should be explored to prevent loss of low-level acoustic information, favoring joint fine-tuning over frozen encoders.
  • Instruction Tuning: Borrowing methodologies from advances in text-only instruction tuning could improve prompt robustness and model adherence to task specifications.
  • Strengthening LLM Backbones: Leveraging distillation from high-performing models (e.g., GPT-4o) that show competence in phoneme processing and few-shot learning is prioritized.
  • Abstract Knowledge Modules: Integration of domain-specific modules for fields such as medical acoustics and music theory may facilitate progress from L4 to L5 tasks.
  • Benchmark Expansion: Supplementing SAGI with additional L4/L5 tasks (e.g., sarcasm, music appreciation) and “in-the-wild” conversational settings is encouraged for more comprehensive generalist evaluation (Bu et al., 2024).

In sum, the SAGI Benchmark translates a rigorous five-level competence framework into a concrete set of >30 tasks, standardizing formats and metrics while exposing present limitations in speech LLMs, especially at the levels of physical acoustics, paralinguistics, and specialist abstraction. Addressing these deficiencies will require innovations spanning data, model architecture, information preservation, instruction tuning, and domain-specific integration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAGI Benchmark.