Papers
Topics
Authors
Recent
Search
2000 character limit reached

InstructS2S-200K Dataset

Updated 13 May 2026
  • InstructS2S-200K is a synthetic dataset comprising 200,000 ⟨speech-instruction, text-response, speech-response⟩ triplets for end-to-end speech model training.
  • It employs robust pipelines—including data sourcing, speech-style rewriting, and high-quality TTS synthesis—to produce clean, studio-quality audio samples.
  • The dataset uses HuBERT feature extraction with K-means clustering to facilitate efficient training of speech-instruct models like LLaMA-Omni.

InstructS2S-200K is a large-scale, synthetic dataset specifically designed for aligning open-source LLMs with end-to-end speech interaction workflows. The dataset targets the simultaneous comprehension and generation of spoken language by LLM architectures and is central to training models such as LLaMA-Omni, which operate without intermediate transcription steps. InstructS2S-200K consists of 200,000 paired examples, each in the form of ⟨speech-instruction, text-response, speech-response⟩, constructed using distinct pipelines for data sourcing, rewriting, synthesis, and speech feature extraction. The dataset is intended to facilitate low-latency, high-quality, and scalable speech LLM training in open-source contexts (Fang et al., 2024).

1. Dataset Design and Composition

InstructS2S-200K comprises exactly 200,000 examples, each as a triplet ⟨speech-instruction, text-response, speech-response⟩. The primary purpose is to provide robust coverage for end-to-end training of speech-interactive LLM systems, particularly those eschewing explicit speech recognition modules. The aggregate duration Ttotal=i=1200, ⁣000tiT_{\text{total}} = \sum_{i=1}^{200,\!000} t_i in seconds is not specified, but per-utterance durations can be computed from the underlying audio files.

No internal splits for development or testing are present within InstructS2S-200K; all 200,000 examples are allocated for training purposes. Evaluation is performed on a separate set, InstructS2S-Eval, which derives from Alpaca-Eval and is not bundled with the dataset (Fang et al., 2024).

The core structural statistics are summarized as follows:

Component Value Notes
Examples 200,000 All for training
Triplet form Yes ⟨speech-instruction, text-response, speech-response⟩
Speech data WAV Clean, synthesized, studio-quality

2. Data Sourcing, Instruction Rewriting, and Annotation

The instruction set is sourced from two existing datasets: 50,000 from Alpaca, representing general-purpose tasks, and 150,000 first-turn instructions from UltraChat, focused primarily on world-knowledge questions. This distribution yields a 25%/75% split of “general-purpose” versus “fact-question” prompts.

Each text instruction undergoes “speech-style” rewriting using Llama-3-70B-Instruct. This process involves:

  • Inserting natural conversational fillers such as “uh”, “um”, and “so”
  • Expanding numerals to their word equivalents
  • Shortening verbose queries

Llama-3-70B-Instruct also generates concise, speech-appropriate responses. Specific directives restrict output to linear utterances, disallowing lists or parentheses and requiring numbers to be expressed in words.

Speech synthesis operates via two separate TTS models:

  • CosyVoice-300M-SFT synthesizes instruction utterances, randomized between male and female voices.
  • VITS (trained on LJSpeech) synthesizes responses, always using a single “standard” synthetic voice.

All data is cleanly synthesized, with no background noise or channel perturbation. Audio is typically stored as WAV at a minimum sampling rate of 22,050 Hz, although the exact rate is inherited from the respective TTS models.

There are no human transcriptions; the text and speech are directly aligned, with text obtained from the synthetic prompt or reply used as TTS input.

For speech decoder training, feature representation is derived using HuBERT to generate discrete units. These features are clustered by K-means (with K=1000K=1000) and collapsed using CTC alignment.

3. Data Organization and File Structure

The dataset is organized under a canonical directory layout that includes a metadata file and a flat audio file structure. The metadata file, metadata.jsonl, encodes all necessary triplet information as one JSON object per line:

Field Description
instruct_id Unique identifier
instruction_text Speech-style text prompt
response_text Speech-style text response
instruction_audio Path to WAV file for instruction
response_audio Path to WAV file for response
tts_voice_inst TTS model and gender for instruction
tts_voice_resp TTS model/gender for response
duration_inst_s Duration (seconds) of instruction utterance
duration_resp_s Duration (seconds) of response utterance

The audio directory holds all instruction and response files as inst_{NNNNNN}.wav and resp_{NNNNNN}.wav, respectively, for all 1N200,0001 \leq N \leq 200,000. Users can extract utterance durations from audio file headers as needed.

4. Content Analysis and Speech Characteristics

Instruction categories are single-turn, open-ended, and do not include dialogue-style turns, code, or mathematics. Prompts are predominantly open-domain or world-knowledge questions, with the following approximate characteristics:

  • Instruction utterance length: Typically 5–12 spoken tokens (including fillers)
  • Response utterance length: Typically 6–15 spoken tokens
  • Speaker and voice diversity: All speakers are synthetic; instructions randomized between two CosyVoice voices (male/female), responses fixed to a single VITS voice.

No accent, regional, or environmental diversity beyond what is natively provided by TTS models. This exclusive reliance on synthetic voices results in highly uniform, noise-free, studio-quality recordings.

Average audio durations per utterance are not explicitly reported but can be computed by parsing the WAV audio headers. This suggests any derived timing statistics must be computed post-hoc by dataset users.

5. Speech Feature Extraction and Preprocessing

For alignment with the LLaMA-Omni architecture, recommended preprocessing and feature workflows are as follows:

  • Feature extraction: HuBERT is used to compute discrete speech units, which are then clustered via K-means (K=1000K=1000). Repeat units are collapsed via CTC for efficient decoder training.
  • Frame downsampling: Input speech features are downsampled by a factor of 5 in the speech adaptor module.
  • No noise augmentation: Clean synthetic outputs are delivered as-is; however, end-users may add noise or speed perturbation for robustness, although no such augmentation is included in the native release.

A plausible implication is that models trained solely on this dataset may show limited robustness to real-world variability absent further augmentation or fine-tuning.

6. Licensing and Usage Considerations

The dataset is constructed for research on end-to-end speech-instruct models such as LLaMA-Omni, enabling joint speech understanding and generation tasks. Licensing status for the assembled dataset is not explicitly determined; users are advised to independently verify the terms for all components, including Alpaca (MIT-style), UltraChat (open-source), CosyVoice, VITS, and the HuBERT quantizer.

Practical dataset usage comprises loading metadata, mapping triplets to training input for LLMs capable of raw speech understanding/generation, and extracting utterance timing or duration as needed through standard audio parsers. Training on the full dataset is feasible with moderate computational resources; LLaMA-Omni was trained in under 3 days using 4 GPUs (Fang et al., 2024).

7. Statistical Metrics and Reference Formulas

Aggregate dataset duration TtotalT_{\text{total}}, mean utterance duration μdur\mu_{\text{dur}}, and duration standard deviation σdur\sigma_{\text{dur}} can be calculated according to:

Ttotal=i=1NtotaltiT_{\text{total}} = \sum_{i=1}^{N_{\text{total}}} t_i

μdur=1Ntotali=1Ntotalti\mu_{\text{dur}} = \frac{1}{N_{\text{total}}} \sum_{i=1}^{N_{\text{total}}} t_i

σdur=1Ntotali=1Ntotal(tiμdur)2\sigma_{\text{dur}} = \sqrt{ \frac{1}{N_{\text{total}}} \sum_{i=1}^{N_{\text{total}}} (t_i - \mu_{\text{dur}})^2 }

where K=1000K=10000 and K=1000K=10001 is the total duration of the K=1000K=10002-th instruction+response pair in seconds.

Dataset users typically extract and process example durations directly from the file metadata and WAV headers prior to feeding the data into their speech-LLM pipelines. Further statistical analysis of utterance or instruction diversity is possible, conditional on parsing the existing metadata structures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InstructS2S-200K Dataset.