Papers
Topics
Authors
Recent
2000 character limit reached

VoxBox: A Large-Scale Speech Dataset

Updated 7 December 2025
  • VoxBox is a large-scale speech dataset comprising over 100,000 hours of English and Mandarin recordings with rich demographic and paralinguistic annotations.
  • It aggregates data from 29 open-source corpora, covering studio, conversational, and emotional recordings with detailed metrics on utterance counts, age, and gender.
  • The dataset supports TTS research by providing fine-grained control over attributes like pitch, speaking rate, and emotion, enabling reproducible benchmarks for LLM-based systems.

VoxBox is a large-scale, attribute-rich speech dataset comprising over 100,000 hours of multi-domain English and Mandarin Chinese recordings, constructed to advance research in controllable text-to-speech (TTS), zero-shot voice cloning, and related speech language modeling tasks. The dataset is introduced to provide broad coverage of demographic attributes, fine-grained paralinguistic control, and reproducible, open benchmarks for LLM-based TTS systems such as Spark-TTS (Wang et al., 3 Mar 2025).

1. Dataset Composition

VoxBox aggregates 47,706,212 utterances from 29 open-source speech corpora, spanning English and Mandarin Chinese. The total duration sums to 102,500.26 hours (approximately 102.5k hours), with a stratified representation of gender and age. Major subsets include:

  • Chinese: Sources such as AISHELL-3 (88,035 utterances, 85.62 h), Emilia-CN (15,629,241 utterances, 34,759.45 h).
  • English: Sources such as CREMA-D, Dailytalk, Emilia-EN (8,303,103 utterances, 20,297.98 h).

The dataset encompasses clean studio recordings (e.g., Hi-Fi TTS, Librispeech), in-the-wild conversational speech (e.g., Dailytalk, MSP-Podcast), emotional corpora (e.g., CREMA-D, IEMOCAP, RAVDESS), and multi-domain text types (reading, dialogue, storytelling, spontaneous speech). Demographic @@@@2@@@@ is supported by large corpora: AISHELL-3 (218 speakers), MAGICDATA (377 speakers), MLS-English (7,000+ speakers), WenetSpeech4TTS (approximately 8,000 speakers), among others. Gender (male/female) and age (Child, Teenager, Young Adult, Middle-aged, Elderly) are explicitly annotated.

Language Source #Utterances Male (h) Female (h) Total (h)
Chinese AISHELL-3 88,035 16.01 69.61 85.62
Chinese Emilia-CN 15,629,241 22,017.56 12,741.89 34,759.45
English Dailytalk 23,754 10.79 10.86 21.65
English Emilia-EN 8,303,103 13,724.76 6,573.22 20,297.98

(See full corpus listing and language-specific totals in (Wang et al., 3 Mar 2025), Table A.1.)

2. Annotation Schema and Attribute Definitions

VoxBox is annotated for a set of speaker and utterance-level attributes relevant to controllable speech synthesis workflows:

  • Gender: Male or female, obtained via fine-tuned WavLM-large; accuracy on AISHELL-3 test is 99.4%.
  • Age: Child, Teenager, Young Adult, Middle-aged, Elderly; WavLM-large fine-tuned for age yields 95.6% accuracy (Table A.2).
  • Pitch:
    • Fine-grained: Mean F0F_0 frequency (rounded), Mel scale.
    • Coarse-grained: Discretized into 5 levels (Very Low, Low, Moderate, High, Very High). Thresholds vary by gender: e.g., for males, Very Low <<145, Low 145–164, Moderate 164–211, High 211–250, Very High ≥\geq250.
  • Speaking Rate:
    • Fine-grained: Syllables per second (SPS), rounded.
    • Coarse-grained: Five-level scale based on language-specific percentiles (e.g., English: Very Slow <<2.6, Slow 2.6–3.4, Moderate 3.4–4.8, Fast 4.8–5.5, Very Fast ≥\geq5.5).
  • Emotion: Combination of: automatic emotion2vec label (with confidence), SenseVoiceSmall label, and text-based emotion class by Qwen2.5-72B (Fearful, Happy, Disgusted, Sad, Surprised, Angry, Neutral).

Pitch and speaking rate groups are defined granularly for each gender and language, supporting fine-tuned synthesis and analysis of paralinguistic control in TTS.

3. Data Collection, Pre-processing, and Quality Control

All data are sourced from open, licensed repositories (Openslr, Kaggle, GitHub), spanning TTS, ASR, and emotion corpora. Pre-processing steps standardize the dataset:

  • Resampling: All files resampled to 16 kHz, WAV (mono), 16-bit PCM.
  • Silence Trimming: Voice Activity Detection (VAD) removes leading/trailing silences for accurate speed estimation.
  • Transcript Cleaning: Scripts from noisy/ASR corpora (Emilia variants) are re-transcribed using FunASR or Whisper-large-v3; files with insertion/deletion discrepancies from gold transcripts are excluded.
  • Quality Control: Any sample failing transcript alignment (insertion, deletion) or lacking confident emotion predictions is rejected.

No additional cross-corpus speaker de-duplication is imposed, beyond existing source dataset splits.

4. Dataset Splits, Availability, and Licensing

VoxBox is primarily constructed as the core training set for Spark-TTS. Where source datasets provide train/dev/test splits, these are retained; otherwise, the entire curated subset enters the training partition. All 29 source corpora provide permissive open-source licenses, with provenance, metadata, and download links supplied in Appendix A.3 of (Wang et al., 3 Mar 2025) and in the Spark-TTS GitHub repository. No further partitioning or de-duplication beyond upstream divisions is enforced.

5. Data Format, Codec, and Tokenization

Audio files in VoxBox are standardized to 16kHz, 16-bit PCM WAV format. The BiCodec tokenization scheme for Spark-TTS employs:

  • Semantic tokens: 50 tokens/sec, codebook size Ks=8192K_s=8192, bs=13b_s=13 bits/token, for a bitrate Rsemantic=650 bpsR_{\text{semantic}} = 650\,\text{bps} or 0.65 kbps0.65\,\text{kbps}.
  • Global tokens: Ng=32N_g=32 per utterance, codebook size Kg=4096K_g=4096, bg=12b_g=12 bits/token; for an utterance of TaudioT_{\text{audio}} seconds, Rglobal≈384Taudio bpsR_{\text{global}} \approx \frac{384}{T_{\text{audio}}}\,\text{bps}, contributing approximately 7.7 bps for a typical 5 s sample.

Each token sequence sent to Spark-TTS begins with fine-grained attribute tokens (if used), followed by NgN_g global tokens, then a continuous semantic token sequence at 50 TPS, forming a coherent set suitable for inferring both content and expressive attributes.

6. Statistical Analysis and Distributions

Aggregate statistics and illustrative figures in (Wang et al., 3 Mar 2025) support quantitative analysis:

  • Mean Utterance Duration: μdur=102,500 h4.77×107 utterances≈7.7 s\mu_{\text{dur}} = \frac{102{,}500\,\text{h}}{4.77\times10^7\,\text{utterances}} \approx 7.7\,\text{s}.
  • Gender and Age Distribution: Approximately 60% male / 40% female globally. Most utterances are from Young Adults, with fewer from Teenagers, Middle-aged, Children, and Elderly.
  • Histograms: Appendix Fig. A.1 displays histograms of speaking rate, duration, and pitch for English and Chinese, split by gender.
  • Variance: Variability of pitch and speaking rate can be directly inferred from the banding thresholds described in Section 2 (percentile-derived).

7. Significance for TTS and Speech Research

By assembling over 100k hours of multi-domain speech with detailed categorical and continuous attribute labeling—including gender, pitch, rate, age, and emotion—VoxBox establishes a benchmark foundation for reproducible, open research in controllable TTS. Its integration into Spark-TTS demonstrates the dataset’s role in facilitating zero-shot voice cloning and highly customizable speech synthesis, transcending the limitations of reference-based approaches (Wang et al., 3 Mar 2025). The rich metadata schema and alignment with foundation model architectures enable fine-grained controllability and systematic speech attribute conditioning for LLM-based speech generation frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to VoxBox Dataset.