Dynamic-SUPERB Phase-1 Benchmark

Updated 3 December 2025

Dynamic-SUPERB Phase-1 is the first speech instruction-tuning benchmark that evaluates universal speech models on 33 classification tasks from 22 datasets.
It employs a rigorous methodology featuring diverse task dimensions, detailed evaluation metrics, and comparisons across multiple baseline models.
The benchmark's extensible, community-driven framework fosters continuous contributions to enhance zero-shot generalization in speech tasks.

Dynamic-SUPERB Phase-1 is the first instruction-tuning benchmark in speech designed to evaluate and enable the development of universal speech models across a broad array of classification tasks. Addressing the scarcity of standardized, extensible benchmarks for zero-shot generalization in speech, Dynamic-SUPERB Phase-1 combines a diverse set of tasks and datasets with a rigorous protocol, detailed evaluation metrics, multiple baselines, and a mechanism for ongoing community-driven expansion (Huang et al., 2023).

1. Design Objectives and Benchmark Scope

Dynamic-SUPERB Phase-1 aims to establish a benchmark that (a) covers an extensive spectrum of classification-style speech tasks, (b) allows for measurable evaluation of zero-shot generalization, and (c) remains extensible through community contributions. Phase-1 comprises 33 distinct tasks sourced from 22 public English datasets, generating 55 “instances” (each defined as a unique task–dataset pair).

Tasks are grouped into six broad “dimensions”:

Content (CNT): Spoken-term detection, keyword spotting, content-based classification.
Speaker (SPK): Speaker identification, speaker verification.
Semantics (SEM): Intent classification, emotion recognition.
Degradation (DEG): Noise-type classification, codec-type detection.
Paralinguistics (PRL): Speaker age/gender classification, emotion, speaking style.
Audio (AUD): Non-speech audio tagging (e.g., environmental or event classification).

Each instance is categorized as either “seen” (included in model training) or “unseen” (new task–dataset pairs), with 24 seen and 31 unseen instances in the Phase-1 held-out evaluation set. The Dynamic-SUPERB-Train collection uses the same task definitions but alternate datasets, ensuring that evaluation on unseen instances measures true generalization beyond training exposure.

2. Task and Dataset Composition

The benchmark’s 55 instances span 33 tasks and 22 datasets, each representing a distinct classification challenge. All tasks have a fixed label set; dataset documentation includes standard splits, domain descriptions, speaker counts, utterance durations, and SNR or noise level metadata where relevant.

Task–Dataset Examples

Dimension	Sample Datasets and Associated Tasks
Content (CNT)	LibriSpeech (spoken-term detection), LJSpeech (yes/no question classification)
Speaker (SPK)	VCTK (speaker identification/verification), VoxCeleb (speaker verification)
Semantics (SEM)	MELD (emotion recognition), SNIPS (intent classification), SLUE (semantic slot classification)
Degradation (DEG)	Aurora (noise-type classification), MS-SNSD (SNR detection)
Paralinguistics (PRL)	CREMA-D (emotion/style), URTIC (gender/age)
Audio (AUD)	ESC-50 (environmental classification), FSD50K (sound-event tagging)

All tasks are structured as classification with explicit, finite label sets (e.g., {happy, sad, angry}). The full list of tasks and datasets is maintained in the public repository.

3. Instruction-Tuning Instance Construction

Each evaluation instance consists of three components:

Instruction (text): A clear prompt specifying the task.
Utterance(s) (audio): Raw audio input(s) for classification.
Ground-truth Label (text): The correct label, denoted as a single letter (e.g., “A,” “B”).

Instruction prompts are manually authored and then paraphrased via ChatGPT to generate 10–30 distinct phrasings per task. Each prompt concludes with a list of label choices (“The answer could be A, B, or C”) enforcing single-label selection.

For zero-shot evaluation, models receive the concatenated instruction and audio. Output must be exactly the correct label token; deviations (e.g., synonyms, paraphrases) are not accepted. For ASR-to-LLM baseline pipelines, the system concatenates the instruction with the transcript and includes an explicit prompt for single-token selection.

4. Baseline Models and Adaptations

Dynamic-SUPERB Phase-1 assesses five diverse baseline systems, selected to represent the main axes of architectural specialization: speech-only, multimodal, and large-scale LLMs.

Baseline	Architecture & Key Adaptation Steps	Parameter Count
BERT-GSLM	HuBERT discrete audio tokens + BERT-text fusion into uLM; frozen feature extractors; train uLM + projection	~210 M
Whisper (medium)	Encoder–decoder ASR model; appended instruction tokens; fine-tuned encoder and decoder	~769 M
ImageBind-LLM	LLaMA 7B + frozen ImageBind; audio embedding via adapters into LLaMA; adapters only trained	7 B + ~85 M
Whisper-LLM	LLaMA 7B + frozen Whisper encoder for temporal audio features; adapters only trained	7 B + ~85 M
ASR-ChatGPT	Cascade: pre-trained Whisper ASR → instruction + prompt → OpenAI ChatGPT (API, no tuning)	175 B (ChatGPT)

BERT-GSLM utilizes HuBERT and k-means for audio tokenization, with BERT-base frozen for instruction embedding and a projection layer for integration. Whisper is fine-tuned end-to-end with appended instruction context. ImageBind-LLM and Whisper-LLM exploit large-scale text LLMs (LLaMA 7B) that receive audio features via injected adapters; only adapters are trained. ASR-ChatGPT operates as a zero-shot cascade with no fine-tuning.

5. Evaluation Protocol and Metrics

The primary evaluation criterion is classification accuracy:

$\text{Accuracy} = \frac{\text{\# correctly predicted labels}}{\text{total instances}}$

For generative tasks (to be introduced in future phases), additional metrics are defined:

Word Error Rate (WER): $\text{WER} = \frac{S + D + I}{N}$ $WER = \frac{S + D + I}{N}$
- $S=$ substitutions, $D=$ deletions, $I=$ insertions, $N=$ reference word count
Character Error Rate (CER): $\text{CER} = \frac{C_s + C_d + C_i}{C_t}$ $CER = \frac{C _{s} + C _{d} + C _{i}}{C _{t}}$
- $C_s, C_d, C_i=$ character-level substitutions, deletions, insertions; $C_t=$ total reference chars
Macro F1: Used for multi-class and class-imbalanced settings.

Phase-1 focuses on single-label classification, with the groundwork in place for later extension to generative and structured-output tasks.

6. Empirical Results and Analysis

Evaluation reveals a marked gap between seen and unseen instance performance across baselines.

Seen Task Performance (average accuracy by dimension)

Model	CNT	SPK	SEM	DEG	PRL
BERT-GSLM	66.3	49.1	47.2	68.2	52.7
Whisper	95.3	47.9	55.5	71.1	49.4
ImageBind-LLM	64.3	54.7	47.6	78.7	59.8
Whisper-LLM	77.6	91.7	55.7	91.0	66.3
Random	49.9	40.2	41.0	45.9	67.1

Key findings:

Whisper excels at content tasks; Whisper-LLM outperforms in speaker and degradation tasks.
ImageBind-LLM and BERT-GSLM consistently outperform random but trail multimodal LLM baselines.

Unseen Task Performance (average accuracy by dimension)

Model	CNT	SPK	SEM	DEG	PRL	AUD
BERT-GSLM	0.0	32.8	5.3	41.6	12.6	0.0
Whisper	14.4	58.0	13.8	55.4	8.5	0.8
ImageBind-LLM	15.7	45.4	24.7	47.6	20.6	35.7
Whisper-LLM	8.7	60.6	20.9	59.0	6.6	15.9
ASR-ChatGPT	65.0	40.1	69.3	43.5	22.9	9.8
Random	11.8	50.2	33.1	43.1	21.0	23.4

All models demonstrate substantial accuracy drops on unseen tasks. ASR-ChatGPT performs particularly well on semantic tasks but fails on speaker and paralinguistics tasks, while speech-only models sometimes underperform random baselines on novel instances. Multimodal LLMs maintain performance close to random, leveraging text pretraining to parse instructions, but do not demonstrate genuine zero-shot generalization.

7. Framework for Dynamic, Community-Driven Expansion

Dynamic-SUPERB’s extensibility is enabled through a structured pipeline for contributions:

Task Proposal: Contributors define the task, labels, and dataset.
Instruction Templates: 10–30 variants, generated via manual writing and/or LLM paraphrasing.
Submission: Pull request to the public GitHub repository.
Review: Evaluation focuses on technical accuracy, instruction clarity, and novelty.
No Model Retraining: Any new task conforming to the three-component format can be instantly evaluated zero-shot.
Community Infrastructure: An open leaderboard and discussion forum help track baselines and coordinate ongoing efforts.

All benchmark materials—including data, code, training scripts, and documentation—are provided under open access at https://github.com/dynamic-superb/dynamic-superb.

8. Strengths, Limitations, and Future Directions

Dynamic-SUPERB Phase-1 establishes a rigorous multi-dimensional testbed for speech instruction tuning, with several strengths:

Demonstrates that combined temporal speech features and large-scale language modeling (Whisper-LLM) deliver leading performance on seen tasks.
Validates that large pre-trained LLMs (ASR-ChatGPT) can generalize semantic task formats without additional tuning.
Ensures extensibility at the benchmark and methodology level.

However, significant limitations persist:

Zero-shot transfer remains weak: models exploit superficial instruction regularities (“bag-of-words” strategies) rather than semantically grounded task comprehension.
Speech-only or weakly multimodal models (BERT-GSLM, Whisper) lack the knowledge to parse entirely new instructions.
Current coverage excludes generative tasks such as speech-to-speech translation or summarization, slated for future expansion.

Continual evolution through community contributions and method development is integral to the Dynamic-SUPERB roadmap, with improved zero-shot semantic understanding and expanded task modalities as explicit goals (Huang et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Dynamic-SUPERB Phase-1.