InstructS2S-200K Dataset
- InstructS2S-200K is a synthetic dataset comprising 200,000 ⟨speech-instruction, text-response, speech-response⟩ triplets for end-to-end speech model training.
- It employs robust pipelines—including data sourcing, speech-style rewriting, and high-quality TTS synthesis—to produce clean, studio-quality audio samples.
- The dataset uses HuBERT feature extraction with K-means clustering to facilitate efficient training of speech-instruct models like LLaMA-Omni.
InstructS2S-200K is a large-scale, synthetic dataset specifically designed for aligning open-source LLMs with end-to-end speech interaction workflows. The dataset targets the simultaneous comprehension and generation of spoken language by LLM architectures and is central to training models such as LLaMA-Omni, which operate without intermediate transcription steps. InstructS2S-200K consists of 200,000 paired examples, each in the form of ⟨speech-instruction, text-response, speech-response⟩, constructed using distinct pipelines for data sourcing, rewriting, synthesis, and speech feature extraction. The dataset is intended to facilitate low-latency, high-quality, and scalable speech LLM training in open-source contexts (Fang et al., 2024).
1. Dataset Design and Composition
InstructS2S-200K comprises exactly 200,000 examples, each as a triplet ⟨speech-instruction, text-response, speech-response⟩. The primary purpose is to provide robust coverage for end-to-end training of speech-interactive LLM systems, particularly those eschewing explicit speech recognition modules. The aggregate duration in seconds is not specified, but per-utterance durations can be computed from the underlying audio files.
No internal splits for development or testing are present within InstructS2S-200K; all 200,000 examples are allocated for training purposes. Evaluation is performed on a separate set, InstructS2S-Eval, which derives from Alpaca-Eval and is not bundled with the dataset (Fang et al., 2024).
The core structural statistics are summarized as follows:
| Component | Value | Notes |
|---|---|---|
| Examples | 200,000 | All for training |
| Triplet form | Yes | ⟨speech-instruction, text-response, speech-response⟩ |
| Speech data | WAV | Clean, synthesized, studio-quality |
2. Data Sourcing, Instruction Rewriting, and Annotation
The instruction set is sourced from two existing datasets: 50,000 from Alpaca, representing general-purpose tasks, and 150,000 first-turn instructions from UltraChat, focused primarily on world-knowledge questions. This distribution yields a 25%/75% split of “general-purpose” versus “fact-question” prompts.
Each text instruction undergoes “speech-style” rewriting using Llama-3-70B-Instruct. This process involves:
- Inserting natural conversational fillers such as “uh”, “um”, and “so”
- Expanding numerals to their word equivalents
- Shortening verbose queries
Llama-3-70B-Instruct also generates concise, speech-appropriate responses. Specific directives restrict output to linear utterances, disallowing lists or parentheses and requiring numbers to be expressed in words.
Speech synthesis operates via two separate TTS models:
- CosyVoice-300M-SFT synthesizes instruction utterances, randomized between male and female voices.
- VITS (trained on LJSpeech) synthesizes responses, always using a single “standard” synthetic voice.
All data is cleanly synthesized, with no background noise or channel perturbation. Audio is typically stored as WAV at a minimum sampling rate of 22,050 Hz, although the exact rate is inherited from the respective TTS models.
There are no human transcriptions; the text and speech are directly aligned, with text obtained from the synthetic prompt or reply used as TTS input.
For speech decoder training, feature representation is derived using HuBERT to generate discrete units. These features are clustered by K-means (with ) and collapsed using CTC alignment.
3. Data Organization and File Structure
The dataset is organized under a canonical directory layout that includes a metadata file and a flat audio file structure. The metadata file, metadata.jsonl, encodes all necessary triplet information as one JSON object per line:
| Field | Description |
|---|---|
| instruct_id | Unique identifier |
| instruction_text | Speech-style text prompt |
| response_text | Speech-style text response |
| instruction_audio | Path to WAV file for instruction |
| response_audio | Path to WAV file for response |
| tts_voice_inst | TTS model and gender for instruction |
| tts_voice_resp | TTS model/gender for response |
| duration_inst_s | Duration (seconds) of instruction utterance |
| duration_resp_s | Duration (seconds) of response utterance |
The audio directory holds all instruction and response files as inst_{NNNNNN}.wav and resp_{NNNNNN}.wav, respectively, for all . Users can extract utterance durations from audio file headers as needed.
4. Content Analysis and Speech Characteristics
Instruction categories are single-turn, open-ended, and do not include dialogue-style turns, code, or mathematics. Prompts are predominantly open-domain or world-knowledge questions, with the following approximate characteristics:
- Instruction utterance length: Typically 5–12 spoken tokens (including fillers)
- Response utterance length: Typically 6–15 spoken tokens
- Speaker and voice diversity: All speakers are synthetic; instructions randomized between two CosyVoice voices (male/female), responses fixed to a single VITS voice.
No accent, regional, or environmental diversity beyond what is natively provided by TTS models. This exclusive reliance on synthetic voices results in highly uniform, noise-free, studio-quality recordings.
Average audio durations per utterance are not explicitly reported but can be computed by parsing the WAV audio headers. This suggests any derived timing statistics must be computed post-hoc by dataset users.
5. Speech Feature Extraction and Preprocessing
For alignment with the LLaMA-Omni architecture, recommended preprocessing and feature workflows are as follows:
- Feature extraction: HuBERT is used to compute discrete speech units, which are then clustered via K-means (). Repeat units are collapsed via CTC for efficient decoder training.
- Frame downsampling: Input speech features are downsampled by a factor of 5 in the speech adaptor module.
- No noise augmentation: Clean synthetic outputs are delivered as-is; however, end-users may add noise or speed perturbation for robustness, although no such augmentation is included in the native release.
A plausible implication is that models trained solely on this dataset may show limited robustness to real-world variability absent further augmentation or fine-tuning.
6. Licensing and Usage Considerations
The dataset is constructed for research on end-to-end speech-instruct models such as LLaMA-Omni, enabling joint speech understanding and generation tasks. Licensing status for the assembled dataset is not explicitly determined; users are advised to independently verify the terms for all components, including Alpaca (MIT-style), UltraChat (open-source), CosyVoice, VITS, and the HuBERT quantizer.
Practical dataset usage comprises loading metadata, mapping triplets to training input for LLMs capable of raw speech understanding/generation, and extracting utterance timing or duration as needed through standard audio parsers. Training on the full dataset is feasible with moderate computational resources; LLaMA-Omni was trained in under 3 days using 4 GPUs (Fang et al., 2024).
7. Statistical Metrics and Reference Formulas
Aggregate dataset duration , mean utterance duration , and duration standard deviation can be calculated according to:
where 0 and 1 is the total duration of the 2-th instruction+response pair in seconds.
Dataset users typically extract and process example durations directly from the file metadata and WAV headers prior to feeding the data into their speech-LLM pipelines. Further statistical analysis of utterance or instruction diversity is possible, conditional on parsing the existing metadata structures.