Dynamic-SUPERB Phase-1 Benchmark
- Dynamic-SUPERB Phase-1 is the first speech instruction-tuning benchmark that evaluates universal speech models on 33 classification tasks from 22 datasets.
- It employs a rigorous methodology featuring diverse task dimensions, detailed evaluation metrics, and comparisons across multiple baseline models.
- The benchmark's extensible, community-driven framework fosters continuous contributions to enhance zero-shot generalization in speech tasks.
Dynamic-SUPERB Phase-1 is the first instruction-tuning benchmark in speech designed to evaluate and enable the development of universal speech models across a broad array of classification tasks. Addressing the scarcity of standardized, extensible benchmarks for zero-shot generalization in speech, Dynamic-SUPERB Phase-1 combines a diverse set of tasks and datasets with a rigorous protocol, detailed evaluation metrics, multiple baselines, and a mechanism for ongoing community-driven expansion (Huang et al., 2023).
1. Design Objectives and Benchmark Scope
Dynamic-SUPERB Phase-1 aims to establish a benchmark that (a) covers an extensive spectrum of classification-style speech tasks, (b) allows for measurable evaluation of zero-shot generalization, and (c) remains extensible through community contributions. Phase-1 comprises 33 distinct tasks sourced from 22 public English datasets, generating 55 “instances” (each defined as a unique task–dataset pair).
Tasks are grouped into six broad “dimensions”:
- Content (CNT): Spoken-term detection, keyword spotting, content-based classification.
- Speaker (SPK): Speaker identification, speaker verification.
- Semantics (SEM): Intent classification, emotion recognition.
- Degradation (DEG): Noise-type classification, codec-type detection.
- Paralinguistics (PRL): Speaker age/gender classification, emotion, speaking style.
- Audio (AUD): Non-speech audio tagging (e.g., environmental or event classification).
Each instance is categorized as either “seen” (included in model training) or “unseen” (new task–dataset pairs), with 24 seen and 31 unseen instances in the Phase-1 held-out evaluation set. The Dynamic-SUPERB-Train collection uses the same task definitions but alternate datasets, ensuring that evaluation on unseen instances measures true generalization beyond training exposure.
2. Task and Dataset Composition
The benchmark’s 55 instances span 33 tasks and 22 datasets, each representing a distinct classification challenge. All tasks have a fixed label set; dataset documentation includes standard splits, domain descriptions, speaker counts, utterance durations, and SNR or noise level metadata where relevant.
Task–Dataset Examples
| Dimension | Sample Datasets and Associated Tasks |
|---|---|
| Content (CNT) | LibriSpeech (spoken-term detection), LJSpeech (yes/no question classification) |
| Speaker (SPK) | VCTK (speaker identification/verification), VoxCeleb (speaker verification) |
| Semantics (SEM) | MELD (emotion recognition), SNIPS (intent classification), SLUE (semantic slot classification) |
| Degradation (DEG) | Aurora (noise-type classification), MS-SNSD (SNR detection) |
| Paralinguistics (PRL) | CREMA-D (emotion/style), URTIC (gender/age) |
| Audio (AUD) | ESC-50 (environmental classification), FSD50K (sound-event tagging) |
All tasks are structured as classification with explicit, finite label sets (e.g., {happy, sad, angry}). The full list of tasks and datasets is maintained in the public repository.
3. Instruction-Tuning Instance Construction
Each evaluation instance consists of three components:
- Instruction (text): A clear prompt specifying the task.
- Utterance(s) (audio): Raw audio input(s) for classification.
- Ground-truth Label (text): The correct label, denoted as a single letter (e.g., “A,” “B”).
Instruction prompts are manually authored and then paraphrased via ChatGPT to generate 10–30 distinct phrasings per task. Each prompt concludes with a list of label choices (“The answer could be A, B, or C”) enforcing single-label selection.
For zero-shot evaluation, models receive the concatenated instruction and audio. Output must be exactly the correct label token; deviations (e.g., synonyms, paraphrases) are not accepted. For ASR-to-LLM baseline pipelines, the system concatenates the instruction with the transcript and includes an explicit prompt for single-token selection.
4. Baseline Models and Adaptations
Dynamic-SUPERB Phase-1 assesses five diverse baseline systems, selected to represent the main axes of architectural specialization: speech-only, multimodal, and large-scale LLMs.
| Baseline | Architecture & Key Adaptation Steps | Parameter Count |
|---|---|---|
| BERT-GSLM | HuBERT discrete audio tokens + BERT-text fusion into uLM; frozen feature extractors; train uLM + projection | ~210 M |
| Whisper (medium) | Encoder–decoder ASR model; appended instruction tokens; fine-tuned encoder and decoder | ~769 M |
| ImageBind-LLM | LLaMA 7B + frozen ImageBind; audio embedding via adapters into LLaMA; adapters only trained | 7 B + ~85 M |
| Whisper-LLM | LLaMA 7B + frozen Whisper encoder for temporal audio features; adapters only trained | 7 B + ~85 M |
| ASR-ChatGPT | Cascade: pre-trained Whisper ASR → instruction + prompt → OpenAI ChatGPT (API, no tuning) | 175 B (ChatGPT) |
BERT-GSLM utilizes HuBERT and k-means for audio tokenization, with BERT-base frozen for instruction embedding and a projection layer for integration. Whisper is fine-tuned end-to-end with appended instruction context. ImageBind-LLM and Whisper-LLM exploit large-scale text LLMs (LLaMA 7B) that receive audio features via injected adapters; only adapters are trained. ASR-ChatGPT operates as a zero-shot cascade with no fine-tuning.
5. Evaluation Protocol and Metrics
The primary evaluation criterion is classification accuracy:
For generative tasks (to be introduced in future phases), additional metrics are defined:
- Word Error Rate (WER):
- substitutions, deletions, insertions, reference word count
- Character Error Rate (CER):
- character-level substitutions, deletions, insertions; total reference chars
- Macro F1: Used for multi-class and class-imbalanced settings.
Phase-1 focuses on single-label classification, with the groundwork in place for later extension to generative and structured-output tasks.
6. Empirical Results and Analysis
Evaluation reveals a marked gap between seen and unseen instance performance across baselines.
Seen Task Performance (average accuracy by dimension)
| Model | CNT | SPK | SEM | DEG | PRL |
|---|---|---|---|---|---|
| BERT-GSLM | 66.3 | 49.1 | 47.2 | 68.2 | 52.7 |
| Whisper | 95.3 | 47.9 | 55.5 | 71.1 | 49.4 |
| ImageBind-LLM | 64.3 | 54.7 | 47.6 | 78.7 | 59.8 |
| Whisper-LLM | 77.6 | 91.7 | 55.7 | 91.0 | 66.3 |
| Random | 49.9 | 40.2 | 41.0 | 45.9 | 67.1 |
Key findings:
- Whisper excels at content tasks; Whisper-LLM outperforms in speaker and degradation tasks.
- ImageBind-LLM and BERT-GSLM consistently outperform random but trail multimodal LLM baselines.
Unseen Task Performance (average accuracy by dimension)
| Model | CNT | SPK | SEM | DEG | PRL | AUD |
|---|---|---|---|---|---|---|
| BERT-GSLM | 0.0 | 32.8 | 5.3 | 41.6 | 12.6 | 0.0 |
| Whisper | 14.4 | 58.0 | 13.8 | 55.4 | 8.5 | 0.8 |
| ImageBind-LLM | 15.7 | 45.4 | 24.7 | 47.6 | 20.6 | 35.7 |
| Whisper-LLM | 8.7 | 60.6 | 20.9 | 59.0 | 6.6 | 15.9 |
| ASR-ChatGPT | 65.0 | 40.1 | 69.3 | 43.5 | 22.9 | 9.8 |
| Random | 11.8 | 50.2 | 33.1 | 43.1 | 21.0 | 23.4 |
All models demonstrate substantial accuracy drops on unseen tasks. ASR-ChatGPT performs particularly well on semantic tasks but fails on speaker and paralinguistics tasks, while speech-only models sometimes underperform random baselines on novel instances. Multimodal LLMs maintain performance close to random, leveraging text pretraining to parse instructions, but do not demonstrate genuine zero-shot generalization.
7. Framework for Dynamic, Community-Driven Expansion
Dynamic-SUPERB’s extensibility is enabled through a structured pipeline for contributions:
- Task Proposal: Contributors define the task, labels, and dataset.
- Instruction Templates: 10–30 variants, generated via manual writing and/or LLM paraphrasing.
- Submission: Pull request to the public GitHub repository.
- Review: Evaluation focuses on technical accuracy, instruction clarity, and novelty.
- No Model Retraining: Any new task conforming to the three-component format can be instantly evaluated zero-shot.
- Community Infrastructure: An open leaderboard and discussion forum help track baselines and coordinate ongoing efforts.
All benchmark materials—including data, code, training scripts, and documentation—are provided under open access at https://github.com/dynamic-superb/dynamic-superb.
8. Strengths, Limitations, and Future Directions
Dynamic-SUPERB Phase-1 establishes a rigorous multi-dimensional testbed for speech instruction tuning, with several strengths:
- Demonstrates that combined temporal speech features and large-scale language modeling (Whisper-LLM) deliver leading performance on seen tasks.
- Validates that large pre-trained LLMs (ASR-ChatGPT) can generalize semantic task formats without additional tuning.
- Ensures extensibility at the benchmark and methodology level.
However, significant limitations persist:
- Zero-shot transfer remains weak: models exploit superficial instruction regularities (“bag-of-words” strategies) rather than semantically grounded task comprehension.
- Speech-only or weakly multimodal models (BERT-GSLM, Whisper) lack the knowledge to parse entirely new instructions.
- Current coverage excludes generative tasks such as speech-to-speech translation or summarization, slated for future expansion.
Continual evolution through community contributions and method development is integral to the Dynamic-SUPERB roadmap, with improved zero-shot semantic understanding and expanded task modalities as explicit goals (Huang et al., 2023).