VoxBox: A Large-Scale Speech Dataset
- VoxBox is a large-scale speech dataset comprising over 100,000 hours of English and Mandarin recordings with rich demographic and paralinguistic annotations.
- It aggregates data from 29 open-source corpora, covering studio, conversational, and emotional recordings with detailed metrics on utterance counts, age, and gender.
- The dataset supports TTS research by providing fine-grained control over attributes like pitch, speaking rate, and emotion, enabling reproducible benchmarks for LLM-based systems.
VoxBox is a large-scale, attribute-rich speech dataset comprising over 100,000 hours of multi-domain English and Mandarin Chinese recordings, constructed to advance research in controllable text-to-speech (TTS), zero-shot voice cloning, and related speech language modeling tasks. The dataset is introduced to provide broad coverage of demographic attributes, fine-grained paralinguistic control, and reproducible, open benchmarks for LLM-based TTS systems such as Spark-TTS (Wang et al., 3 Mar 2025).
1. Dataset Composition
VoxBox aggregates 47,706,212 utterances from 29 open-source speech corpora, spanning English and Mandarin Chinese. The total duration sums to 102,500.26 hours (approximately 102.5k hours), with a stratified representation of gender and age. Major subsets include:
- Chinese: Sources such as AISHELL-3 (88,035 utterances, 85.62 h), Emilia-CN (15,629,241 utterances, 34,759.45 h).
- English: Sources such as CREMA-D, Dailytalk, Emilia-EN (8,303,103 utterances, 20,297.98 h).
The dataset encompasses clean studio recordings (e.g., Hi-Fi TTS, Librispeech), in-the-wild conversational speech (e.g., Dailytalk, MSP-Podcast), emotional corpora (e.g., CREMA-D, IEMOCAP, RAVDESS), and multi-domain text types (reading, dialogue, storytelling, spontaneous speech). Demographic @@@@2@@@@ is supported by large corpora: AISHELL-3 (218 speakers), MAGICDATA (377 speakers), MLS-English (7,000+ speakers), WenetSpeech4TTS (approximately 8,000 speakers), among others. Gender (male/female) and age (Child, Teenager, Young Adult, Middle-aged, Elderly) are explicitly annotated.
| Language | Source | #Utterances | Male (h) | Female (h) | Total (h) |
|---|---|---|---|---|---|
| Chinese | AISHELL-3 | 88,035 | 16.01 | 69.61 | 85.62 |
| Chinese | Emilia-CN | 15,629,241 | 22,017.56 | 12,741.89 | 34,759.45 |
| English | Dailytalk | 23,754 | 10.79 | 10.86 | 21.65 |
| English | Emilia-EN | 8,303,103 | 13,724.76 | 6,573.22 | 20,297.98 |
(See full corpus listing and language-specific totals in (Wang et al., 3 Mar 2025), Table A.1.)
2. Annotation Schema and Attribute Definitions
VoxBox is annotated for a set of speaker and utterance-level attributes relevant to controllable speech synthesis workflows:
- Gender: Male or female, obtained via fine-tuned WavLM-large; accuracy on AISHELL-3 test is 99.4%.
- Age: Child, Teenager, Young Adult, Middle-aged, Elderly; WavLM-large fine-tuned for age yields 95.6% accuracy (Table A.2).
- Pitch:
- Fine-grained: Mean frequency (rounded), Mel scale.
- Coarse-grained: Discretized into 5 levels (Very Low, Low, Moderate, High, Very High). Thresholds vary by gender: e.g., for males, Very Low 145, Low 145–164, Moderate 164–211, High 211–250, Very High 250.
- Speaking Rate:
- Fine-grained: Syllables per second (SPS), rounded.
- Coarse-grained: Five-level scale based on language-specific percentiles (e.g., English: Very Slow 2.6, Slow 2.6–3.4, Moderate 3.4–4.8, Fast 4.8–5.5, Very Fast 5.5).
- Emotion: Combination of: automatic emotion2vec label (with confidence), SenseVoiceSmall label, and text-based emotion class by Qwen2.5-72B (Fearful, Happy, Disgusted, Sad, Surprised, Angry, Neutral).
Pitch and speaking rate groups are defined granularly for each gender and language, supporting fine-tuned synthesis and analysis of paralinguistic control in TTS.
3. Data Collection, Pre-processing, and Quality Control
All data are sourced from open, licensed repositories (Openslr, Kaggle, GitHub), spanning TTS, ASR, and emotion corpora. Pre-processing steps standardize the dataset:
- Resampling: All files resampled to 16 kHz, WAV (mono), 16-bit PCM.
- Silence Trimming: Voice Activity Detection (VAD) removes leading/trailing silences for accurate speed estimation.
- Transcript Cleaning: Scripts from noisy/ASR corpora (Emilia variants) are re-transcribed using FunASR or Whisper-large-v3; files with insertion/deletion discrepancies from gold transcripts are excluded.
- Quality Control: Any sample failing transcript alignment (insertion, deletion) or lacking confident emotion predictions is rejected.
No additional cross-corpus speaker de-duplication is imposed, beyond existing source dataset splits.
4. Dataset Splits, Availability, and Licensing
VoxBox is primarily constructed as the core training set for Spark-TTS. Where source datasets provide train/dev/test splits, these are retained; otherwise, the entire curated subset enters the training partition. All 29 source corpora provide permissive open-source licenses, with provenance, metadata, and download links supplied in Appendix A.3 of (Wang et al., 3 Mar 2025) and in the Spark-TTS GitHub repository. No further partitioning or de-duplication beyond upstream divisions is enforced.
5. Data Format, Codec, and Tokenization
Audio files in VoxBox are standardized to 16kHz, 16-bit PCM WAV format. The BiCodec tokenization scheme for Spark-TTS employs:
- Semantic tokens: 50 tokens/sec, codebook size , bits/token, for a bitrate or .
- Global tokens: per utterance, codebook size , bits/token; for an utterance of seconds, , contributing approximately 7.7 bps for a typical 5 s sample.
Each token sequence sent to Spark-TTS begins with fine-grained attribute tokens (if used), followed by global tokens, then a continuous semantic token sequence at 50 TPS, forming a coherent set suitable for inferring both content and expressive attributes.
6. Statistical Analysis and Distributions
Aggregate statistics and illustrative figures in (Wang et al., 3 Mar 2025) support quantitative analysis:
- Mean Utterance Duration: .
- Gender and Age Distribution: Approximately 60% male / 40% female globally. Most utterances are from Young Adults, with fewer from Teenagers, Middle-aged, Children, and Elderly.
- Histograms: Appendix Fig. A.1 displays histograms of speaking rate, duration, and pitch for English and Chinese, split by gender.
- Variance: Variability of pitch and speaking rate can be directly inferred from the banding thresholds described in Section 2 (percentile-derived).
7. Significance for TTS and Speech Research
By assembling over 100k hours of multi-domain speech with detailed categorical and continuous attribute labeling—including gender, pitch, rate, age, and emotion—VoxBox establishes a benchmark foundation for reproducible, open research in controllable TTS. Its integration into Spark-TTS demonstrates the dataset’s role in facilitating zero-shot voice cloning and highly customizable speech synthesis, transcending the limitations of reference-based approaches (Wang et al., 3 Mar 2025). The rich metadata schema and alignment with foundation model architectures enable fine-grained controllability and systematic speech attribute conditioning for LLM-based speech generation frameworks.